Is that a dinosaur or a Pokémon? Using deep learning to distinguish fact from fiction

Taxonomy in the animal and Pokémon kingdoms

Binary classification is relatively simple use of machine learning to differentiate truth from lies, good from bad, yes from no, and… Pokémon from dinosaurs.

Pokémon names can be absurd (exeggutor, kangaskhan, mewtwo) and usually it’s pretty easy for a human to identify them. However, once in a while, the Pokémon Company throws a curve ball and names their creatures like actual animals: using fake Latin prefixes & suffixes (e.g. bulbasaur, aerodactyl, bastiodon1).

Paleontologists, on the other hand, have a tendency to be super creative when it comes to naming extinct animals, often at the expense of the general public’s understanding and pronunciation ability. Usually they use Latin roots, but also occasionally use names of people, places, or other languages. These names may not necessarily look any more scientific than a Pokémon name (such as Quetzalcoatlus, Shuvuuia, and Fukuititan). Inspired by some of these sillier names, a few months ago, I built a deep learning model using TensorFlow and Keras to generate new dinosaur and extinct animal names.

The Goal

Perhaps there’s more structure in Pokémon and extinct animal names than meets the eye. Would a computer be better at finding some logic behind the different names? Can I train a deep learning model to determine if a word is the name of a Pokémon or of an extinct animal?

The Data

I begin with a list of 1865 genus names of extinct dinosaurs, pterosaurs, mosasaurs, ichthyosaurs, and plesiosaurs from their respective Wikipedia pages, collectively classified as ‘Animals’. I append a list of 809 Pokémon names from pokemondb.net.2

I also use some silhouette images of dinosaurs and other animals from PhyloPic for the awesome graphics.3

The Model

Developers

This project is an adaption of RStudio’s Keras text classification tutorial to identify positive and negative movie reviews on IMDB. I also re-purposed some functions from my previous project to generate dinosaur names. These functions were originally from Jacqueline Nolis for her project to generate offensive license plates using data from banned plates in Arizona.

Data cleaning

I start by setting aside 20% of the animal and Pokémon names as a test set4 (observations I don’t touch until the very end), leaving me with 2141 names to train the model. Before these names can be used with deep learning algorithms, they must be converted to a format the models can understand.

The models expect a square matrix as input, so I first pad the beginning of each name with *s until each name is 20 characters in length. I then split each padded name into an array of 20 characters, but replace those characters with a unique numeric value. For example, each ‘*’ becomes a ‘0’, ‘a’ becomes ‘1’, and so on… This ends up converting our x values into a square matrix of 2141 rows and 20 columns.

#Processing inputs
max_length <- 20
characters <- c("*", letters)

"bulbasaur" %>% 
  pad_data(max_length) %T>%
  print() %>% 
  vectorize(characters, max_length)
## [1] "***********bulbasaur"
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,]    0    0    0    0    0    0    0    0    0     0     0     2    21
##      [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## [1,]    12     2     1    19     1    21    18

This is the only input data - names are classified based off their characters alone.

The y data (animal or Pokémon) is binary - 0 for animal and 1 for Pokémon.

Click here to see script

library(tidyverse)
library(tokenizers)
library(keras)

set.seed(12334)

#Consolidate data  ----

animals <- readRDS("data/list_of_extinct_reptiles.RDS") %>%
  str_replace_all("[^[:alnum:] ]", "") %>% # remove any special characters
  tolower

pokemon <- readRDS("data/list_of_pokemon.RDS") %>%
  iconv(from="UTF-8",to="ASCII//TRANSLIT") %>% 
  str_replace_all("[^[A-Za-z] ]", "") %>% # remove any special characters
  tolower %>% 
  unique()

full_data <- tibble(Category = "Animal", Name = animals) %>% 
  bind_rows(tibble(Category = "Pokemon", Name = pokemon)) %>% 
  arrange(Name) %>% 
  distinct()

train_data <- full_data %>% 
  sample_frac(0.8)

test_data <- full_data %>% 
  anti_join(train_data)

max_length <- 20

#Functions to process data -
pad_data <- function(dat, max_length){
  dat %>% 
    map_chr(function(x, max_length){
      y <- str_c(paste0(rep("*", max(0, max_length - nchar(x))), collapse=""), x)
      return(y)
    }, max_length)
}
vectorize <- function(dat, characters, max_length){
  x <- array(0, dim = c(length(dat), max_length))
  
  for(i in 1:length(dat)){
    for(j in 1:(max_length)){
      x[i,j] <- which(characters==substr(dat[[i]], j, j)) - 1
    }
  }
  x
}

x_train <- train_data$Name %>% 
  pad_data(max_length)

characters <- x_train %>% 
  tokenize_characters(strip_non_alphanum = FALSE) %>% 
  flatten() %>% 
  unlist() %>% 
  unique() %>% 
  sort()

x_train_v <- x_train %>% vectorize(characters, max_length)

y_train <- as.numeric(train_data$Category == "Pokemon")

Results

For each processed name (a row in the matrix), the model results are a value between 0 and 1, with values closer to 0 more likely to be the name of an extinct animal and values closer to one more likely to be a Pokemon name.

Click here to see script

vocab_size <- length(characters)

#This model specification is borrowed heavily from the RStudio example
# https://keras.rstudio.com/articles/tutorial_basic_text_classification.html

model <- keras_model_sequential()
model %>% 
  layer_embedding(input_dim = vocab_size, output_dim = 16) %>%
  layer_global_average_pooling_1d() %>%
  layer_dense(units = 16, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")

# The first layer is an embedding layer. This layer takes the integer-encoded vocabulary and looks up the embedding vector for each word-index. These vectors are learned as the model trains. The vectors add a dimension to the output array. The resulting dimensions are: (batch, sequence, embedding).
# Next, a global_average_pooling_1d layer returns a fixed-length output vector for each example by averaging over the sequence dimension. This allows the model to handle input of variable length, in the simplest way possible.
# This fixed-length output vector is piped through a fully-connected (dense) layer with 16 hidden units.
# The last layer is densely connected with a single output node. Using the sigmoid activation function, this value is a float between 0 and 1, representing a probability, or confidence level.

model %>% compile(
  optimizer = 'adam',
  loss = 'binary_crossentropy',
  metrics = list('accuracy')
)

#Validation, training
#Prior to modeling, I further split the training data set into a 250 names reserved for validation and the rest for fitting the model.
validation_size <- sample(1:nrow(x_train_v), 250)
partial_x_train <- x_train_v[-validation_size, ]
partial_y_train <- y_train[-validation_size]

x_val <- x_train_v[validation_size, ]
y_val <- y_train[validation_size]

model %>% fit(
  partial_x_train,
  partial_y_train,
  epochs = 80,
  batch_size = 512,
  validation_data = list(x_val, y_val),
  verbose=1
)

Validation

How does the model perform on the 533 names in the test data set? Rounding the results to 0 or 1, the model is great at correctly assigning extinct animal names to the animal category, but slightly less successful at correctly identifying Pokemon names.5

Table 1: Animal / Pokémon classification accuracy: Test data
Actual / Predicted Animal Pokémon
Animal 349
(93.3%)
25
(6.7%)
Pokémon 23
(14.5%)
136
(85.5%)

Meta-validation

As mentioned above, I had previously used Keras to develop a model to generate new names for dinosaurs and other extinct animals. How well does this new model classify those names that are designed to look like animal names?6 For completion, I reran that model on the Pokemon names to also generate brand new Pokemon-style names7.

Table 2: Animal / Pokémon classification accuracy: Generated data
Generated / Predicted Animal Pokémon
Animal 260
(91.9%)
23
(8.1%)
Pokémon 30
(15.6%)
162
(84.4%)

The error is more or less the same as the test data, which is great! …I think 🤔

Pattern validation

Finally, I want to see how sensitive the model is to common prefixes and suffixes in animal and Pokémon names. I take six animal and Pokémon names, split them into a prefix and suffix, and combine those to get 36 real and fake names. I run those names through to the model to see if there is a pattern in the predictions.

Table 3: Model probabilities of prefix & suffix combinations
Prefix - Suffix -saurus -tar -dactylus -buzz -docus -chu
Allo- Animal
87.5%
Pokémon
83.4%
Animal
95.0%
Pokémon
88.5%
Animal
60.3%
Pokémon
85.1%
Tyrani- Animal
98.1%
Animal
68.6%
Animal
99.0%
Animal
57.9%
Animal
94.3%
Animal
65.9%
Ptero- Animal
95.4%
Pokémon
62.6%
Animal
97.8%
Pokémon
72.7%
Animal
82.0%
Pokémon
65.5%
Electa- Animal
96.3%
Pokémon
56.7%
Animal
98.2%
Pokémon
67.6%
Animal
85.3%
Pokémon
59.8%
Diplo- Animal
95.0%
Pokémon
64.7%
Animal
97.8%
Pokémon
74.4%
Animal
80.7%
Pokémon
67.5%
Pika- Animal
85.6%
Pokémon
85.5%
Animal
94.1%
Pokémon
89.4%
Animal
56.4%
Pokémon
86.9%
Note:
Actual animal & Pokémon name down the diagonal
Green: Name correctly identified
Red: Name incorrectly identified

The prefix ‘Tyrani’ skews the model results heavily toward the animal classification (there are 11 animals with ‘tyran’ in the data). However, the model results seem to be fully determined by the suffix, except for when ‘Tyrani’ is the prefix.

Conclusion

The error rates on the model aren’t great, but considering that the names of both extinct animal and Pokémon are generated by humans and some Pokémon names are inspired by actual animal names, the model performed better than I had expected. More importantly, it was a great opportunity to make this ggplot:


Try it out! Full script can be found on GitHub!


  1. Bastiodon is a Pokémon, but if it were a Latin name, it would mean something like ‘construction tooth’. Fitting since it looks like a toothy bulldozer-triceratops.

    411Bastiodon

  2. I’m curious is the uneven sample sizes for animals & Pokémon affect the strength of the model, especially since it seems the model results are skewed toward classifying words as animals. A quick Google search suggests that it’s not a problem, but I will investigate this more at a later time.

  3. PhyloPic is raising money for a major overhaul. Donate here!

  4. Look at me being a proper data scientist.😏

  5. Which of names did the model get incorrect? I’m not sure a human would have done much better. Look at some of those animal names! @paleontologists WTF?!

    Animal Pokémon
    apatodon aerodactyl
    avalonia alomomola
    avimimus cresselia
    azhdarcho dragonite
    banguela druddigon
    camelotia escavalier
    citipati fletchinder
    deinodon gothorita
    dracorex hitmontop
    elaltitan kricketune
    eolambia misdreavus
    feilongus ninetales
    gastonia palpitoad
    glishades pheromosa
    gryponyx poochyena
    gualicho roggenrola
    itemirus shroomish
    koparion sudowoodo
    kryptops togedemaru
    nomingia toucannon
    shuvuuia tyrantrum
    tapejara vanilluxe
    thililua volcarona
    velafrons
    zapsalis
  6. I don’t think this is proper data science. 🤷

  7. Some of these generated Pokémon names are awesome. Email me, Nintendo.

Avatar
Ryan Timpe
Data Science | Economics

Related