Generating dinosaur names using deep learning

Deep learning with Keras

RStudio just released a new package and book for neural networks and machine learning in R. I’m excited to learn all of the finer details of TensorFlow and Keras, but I’m also a bit impatient.

For better or worse, I decided to jump (mostly) blind-folded into the package and see what I can produce.

The Goal

If you look around the rest of this site, you’ll notice that my brand is dinosaurs (and divas). So… Can I train a deep learning model to develop new dinosaur names?

The Data

I trained a model using 1865 genus names of dinosaurs, pterosaurs, mosasaurs, ichthyosaurs, and plesiosaurs. The latter 4 are not dinosaurs, but I wanted to increase my training data set to include as many extinct “saurs” as I could find.

I also used 217 silhouette images of dinosaurs and other animals from PhyloPic. I had collected these for Datasaurs, a Twitter bot finding correlations between dinosaur shapes and US cause of death time series.1

The Model

Developers

I had it easy for this project. Jonathan Nolis published his project to generate offensive license plates that would be banned in Arizona. He based his work off of the Keras example creating new text in the style of Nietzsche. Jonathan figured that predicting letters from license plates isn’t much different than predicting words from tomes. Generating new dinosaurs names was the logical next step, of course.

An Explanation

The links I embedded above lay out the logic and inner workings of the model in much more detail than I will here (as I said, I’m still a beginner with deep learning and neural networks). The algorithm commonly used for text prediction is Recurrent Neural Networks (RNN). Given a string a characters (or words), the algorithm chooses the next character (or word) based off the distribution of those sequences in the training data.

In this case, the training data is the list of those 1865 extinct animal names. As you can imagine, among these names, the sequence “…saur…” shows up many times in that list.2

When the model is generating a new name, it has an intermediate sequence of letters that begin the name and needs to add a character. Assume the model iteration is currently at saur* and it is about a pick a new character. There are 27 possible options for the model to choose from: the 26 letters of the alphabet and the option for the model to stop adding letters. However, those 27 options are not equally likely - the probability of each one comes from the training data.

Below, is a simplified illustration of the saur* example. Rather than 27 options for a next character, assume there are only 4 possible letters and a stop character. Selecting the “u” as the next character is the most likely scenario, followed by “a” or “(stop)”.

Probability of letters after ‘saur*’
Next Letter Probability
u 0.45
a 0.15
i 0.12
o 0.12
(stop) 0.15

From there, the chain will continue. If the model samples the “u” from the selection of characters, then it’s probable that the next character after that would be an “s” and then stop after that - yielding “saurus.” If the model samples the “a” and then stops, the result would be “saura.”

Changing the Temperature

One of the hyperparameters in the model is temperature. A value of 1 means that each character is sampled with the default probabilities (e.g. 45% chance for “u” after “saur*”). Decreasing this value below 1 increases the probabilities of the more likely characters, creating more conservative predictions. Temperature values above 1 work the opposite way, moving all probabilities closer to each other.

Probability of letters after ‘saur*’ with varying temperature
Next Letter Prob (T=0.25) Prob (T=0.5) Prob (T=1) Prob (T=2) Prob (T=4) Prob (T=10)
u 0.97 0.73 0.45 0.31 0.25 0.22
a 0.01 0.08 0.15 0.18 0.19 0.20
i 0.01 0.05 0.12 0.16 0.18 0.19
o 0.01 0.05 0.12 0.16 0.18 0.19
(stop) 0.01 0.08 0.15 0.18 0.19 0.20

In the example above, as the temperature approaches 0, the probability of the model selecting “u” as the next character climbs closer to 100%. On the other side of T=1, as the temperature approaches infinity, each option becomes equally likely (the prediction is a random draw). In the example with 5 possible characters, each has a 20% chance of being selected. In the full model with 27 characters, each has a 3.7% (1/27) chance.

For the results in this post3, I sampled 500 new names from the model with temperatures varying from 0.5 to 1.5. 460 of the resulting names were unique, and 283 of those not currently named dinosaurs. Other runs of the model with lower (more conservative) temperature values yielded more duplicates and more actual dinosaur names.

Phylogenetic tree

I accomplished my goal (and learned something!): I tuned a deep learning model to come up with a handful of mostly believable dinosaur names. Now the fun part - creating nonsense phylogenetic trees to display the nonsense dinosaurs.

To add some legitimacy to this project, I used it as an opportunity to practice my functional programming. In short, each tree is developed by:

  • Sampling N model-generated dinosaur names (16 is a good number)
  • Using the adist() function in base R to compare the names with the names of actual animals in the PhyloPic database. An image of an animal with a name similar to the generated name was selected to represent the animal graphically.
  • Assign each of the N dinosaurs an x-position and y-position.
  • Create a tree linking each dinosaur with one to the left of it.
  • Plot with geom_path() + geom_raster() + geom_label().

Check out the code on GitHub to run your own trees!

n_kerasaurs <- 16
phylo_resolution <- 32

kerasaurs #Full list of model-generated names
phylo #List of PhyloPic images

kerasaurs %>% 
  find_similar_phylopic(phylo) %>% 
  create_tree_data( n_kerasaurs, phylo_resolution) %>% 
  generate_tree(phylo_resolution)

Running this function any number of times will results in the images below:


  1. One of the greatest scientific discoveries of the 21st century.

  2. Exactly 834 of the 1865 training animals contain the string “saur” in the name. Below are some other common character combinations in the training data.

    Pattern Freq
    saur 834
    odon 84
    tops 68
    ylus 55
    raptor 47
    titan 34
    mimus 19
  3. I knew the model was tuned correctly when it produced “ponysaurus”. Apparently this is also the name of a brewery in North Carolina.

    ponysaurus

Related