Using data science tools to understand museum visitor engagement


The Yale Peabody Museum was designed and constructed in the 1920’s - a time when the best way visitors could share their experience with others was a postcard from the gift shop1. In the 100 years since then, technology has evolved and visitors have changed how they engage with the museum and others; smartphones, photos, and selfies are now a common feature of the modern museum visit.

Though the exhibits are understandably slow to adapt to ever-changing technology2, museums can update their public programs offerings to react to changing visitor behavior. Museums should understand how visitors engage (or not engage) with the exhibits and specimens throughout the halls and curate the programs to better relate to the guests. In turn, visitors will have a more positive museum experience, which can translate into return trips, increased revenue, and new guests.

In order to update museum offerings to reflect the changing habits of visitors, it’s imperative to understand those habits. Many visitors share their experiences on social media platforms. These social media posts capture how visitors interact with museum exhibits through their phones, providing insights into what visitors do when they aren’t reading signage or engaging with museum staff and volunteers. With a better understanding of how guests interact inside the walls, museums can improve events, exhibits, and marketing to further encourage this behavior. After all, each positive social media post at the museum is a free advertisement for the Peabody.

Note: I’m not a Peabody Museum employee, just a fan.

The Goal

Using public Instagram posts tagged at the museum, can we understand how visitors engage with Yale Peabody Museum exhibits through their smartphones and social media?

The Data

I look at public Instagram posts tagged at the Yale Peabody Museum. Each Instagram post comes with 4 key pieces of information we can leverage about the guests’ experiences:

  1. Image
  2. Date & time of the post
  3. Caption / description
  4. Hashtags

For this post, I will primarily use the content of the images in each of these Instagram posts3. One way to do this is to visually inspect each Instagram image and manually create notes. However, with 3,257 images tagged at the Peabody, human review is not possible. Rather, for this analysis, I take advantage of another area of rapidly evolving technology - artificial intelligence. Google Cloud Vision is a service that can examine a photo and label the objects identified in them.

Below is an example of the Cloud Vision output for an image of Deinonychus in the foreground of the Age of Reptiles mural in the Great Hall.

Google Cloud Vision Example

Google Cloud Vision Example

The AI recognizes with the most confidence that it is looking at an image of dinosaurs, perhaps T. rex and Velociraptor. The tool also identifies a tree, which will be valuable later for differentiating between fossil dinosaurs and painted dinosaurs in the Age of Reptiles mural.

Using the Cloud Visions API and tools in R, I found image labels for each of the Instagram posts.

Selection bias

We need to be cautious about using the observations of Instagram posts to make conclusions about museum experience for all visitors. Only a small subset of museum guests use Instagram - teens, young adults, and parents are probably more inclined to use the platform than school children and older adults. From there, not all users select the Peabody Museum location tag with the post, meaning I have not collected every Instagram post from the museum4.

Conversely, some Instagram users post photos listing the Peabody Museum as the location, though the photo was not taken at or around the museum (there are a few super strange photos from this subset of people). For the most part, this error is small. We can use some of the tools later in the analysis to identify these images.

The goal is to use the visitors in the transparent yellow / green box to understand the behavior of the visitors in the light blue box, the museum guests who interact with the exhibits using their smartphones and social media. We won’t have enough information to understand the behavior of visitors in the darker blue box.

The Analysis

Most common labels

Google Cloud Vision AI identified between 5 and 10 labels for each of the photos. A cursory overview of the labels shows that the AI can identify popular dinosaurs, animals, and minerals, but for the most part, the image labels are more general descriptors.

Below are the most common labels across all images.

This table alone immediately provides us with some information about how visitors are interacting with the museum on social media. ‘Dinosaur’, ‘mineral’, and ‘fauna’, which describe the three main exhibits at the Peabody Museum, are some of the most common image labels. ‘Sculpture’ and ‘statue’ refer to the Torosaurus statue and other smaller exhibits, and ‘Tyrannosaurus’ and ‘tree’ refer to the Age of Reptiles mural. The tag ‘fun’ shows up in 5% of images, which upon visual inspection, is associated with smiling people and events.

Image grouping

The Cloud Vision AI identified 1,322 labels across all images, meaning we have over a thousand ways to describe each of the 3,257 Instagram posts. In order to interpret this data beyond simple label counting, we can use some more sophisticated data science tools.


The first tool is an algorithm called “clustering”, which looks at the labels associated with each photo and groups the similar Instagram posts together. A simple example of clustering is a map of the world, with two dimensions: longitude and latitude. The U.S., Canada, and Mexico are clustered in ‘North America’ while Italy, Germany, and France are clustered in ‘Europe’.

Cluster analysis of the Instagram posts at the Peabody Museum works the same way, but with 1,322 dimensions instead of two. This analysis is able to separate the Instagram image labels into 10 unique groups. From there, I examine the labels associated with each group and translate them into categories relevant to the museum. For example, one group contains the labels crystal, emerald, jewellery, mineral, and quartz, which likely describe photos taken in the David Friend Hall. Another group has the labels dinosaur, museum, tourist attraction, tyrannosaurus, and velociraptor, suggesting these photos were taken the Great Hall5.

The human curation of the labels is important to add context to the cluster results. A person familiar with the layout of the Peabody Museum would immediately be able to recognize the importance of most of these image labels and assign them to different exhibits or locations in the museum. The table below lists the clusters as I’ve named them, along with the percent share of Instagram posts that belong to each one.

Table 1: Peabody Museum Instagram post clusters
Cluster Share Cluster Share
Exhibits & other 48% Outdoors 5%
Great Hall 11% Architecture 4%
Mineral hall 11% Birds 3%
People 9% B&W photography 3%
Torosaurus Statue 7% Flowers 1%

Clustering places each photo into a distinct group, though in reality, photos contain multiple subjects that could fit into multiple clusters. In this analysis, many of those multi-subject images end up in the largest cluster I call ‘Exhibits & other’. In addition to this, three other clusters specifically refer to exhibits: ‘Great Hall`, ’Mineral Hall’, and ‘Birds’. Another group of images with the tags ‘sculpture’, ‘statue’, ‘dinosaur’, and ‘sky’ refers to the Torosaurus statue outside the museum.

Overall, this cluster analysis gives us a much better idea of the content of the museum visitor’s social media posts, but it can also oversimplify the story of how the visitors engage with exhibits.

Principal Component Analysis

To further understand the content of Instagram posts, we can use a different data science tool called Principal Component Analysis (PCA)6. Rather than put each image into a distinct cluster, PCA breaks down the features of the photos, identifying the labels that best explain the differences between each Instagram post7. This analysis returns a handful of scores for each image label. The magnitude of this score, either below or above zero, is related to the label’s importance for that principal component.

For example, the first principal component finds that the labels ‘glasses’, ‘face’, and ‘smile’ contrast with the labels ‘dinosaur’, ‘architecture’, and ‘building’. This tells us that a key defining feature of Instagram photos at the Peabody Museum is Selfies & people vs. Objects. In other words, a great way to sort social media posts is by selfies & people or objects.

Next, PCA identifies contrasting image labels as Natural vs. Man-made, and then Dinosaurs vs. Insects & flowers. In total, I’ve identified 15 key factors in classifying Instagram posts taken at the Peabody Museum using principal component analysis. This analysis allows us to now assess visitor interaction with 15 descriptions, rather than looking at each of the 1322 labels produced by the Cloud Vision AI.

Like This… …or This?
Selfies & people Objects
Natural Man-made
Dinosaurs Insects & flowers
Flowers Insects
Minerals Birds & flowers
Flowers & dinosaurs Birds
Stone carving Nature & buildings
Egypt & stone Handwritten
Birds Bears
Food Artsy
Outside Inside
Events Non-events
Bears Mountain sheep
Dogs Cats
Dogs & cats Reptiles & bears

Think of this as a Peabody Museum version of the game Twenty Questions (but actually only 15 questions). Is an Instagram photo more like the left option, right option, or neither? Repeat this for each of the components to get a clear understanding of the contents of the photo.

Plotting these features

We can’t play this 15 questions game for every individual photo, but viewing these principal components (questions) together for every image on a 2-dimensional computer screen is a challenge. Instead, we can look at pairs of these components. Below, I plot the ‘Selfies & people vs. Objects’ description from left (more selfies & people) to right (more objects) and then the ‘Natural vs. Man-made’ description from bottom (more natural) to top (more man-made).

In this chart, each point represents one Instagram post, colored by the clusters from the analysis above. Data points (photos) away from the center of the plot can be described by one or both of these principal components, while points in the center of plot are not cleanly described by these components.

We can see some clear relationships from this chart - posts labeled as more selfies & people are slightly more likely to be better described as also being more man-made, though most are close to horizontal black line, suggesting a weak relationship. These images, as expected, are almost exclusively in the purple ‘People’ cluster.

On the upper right of the chart, we can see that posts that belong to the ‘Architecture’, ‘Outdoors’, and ‘Torosaurus statue’ clusters are more about objects and are more man-made. The ‘Birds’ cluster extends toward the bottom in the objects and natural quadrant.

The green data points, belonging to the ‘Exhibits & other’ cluster, stand out for being a large mass in the center of the plot. For the most part, these points are not strongly ‘Selfies & people vs. Objects’ or ‘Natural vs. Man-made’. As you “rotate” the plot over different pairs of principal components, the majority of these points remain close to the center of the plot. These images tend to have a lot of detail or components (people, exhibits, events, close-up specimens), so the Cloud Vision labels don’t always capture the full detail.

Below, I rotate the chart and plot the ‘Minerals vs. Birds & flowers’ components against the same ‘Natural vs. Man-made’ component. Here I remove the green ‘Exhibits & other’ cluster so we can more clearly see patterns in more clearly-defined clusters.

In this view of the data, the purple ‘People’ cluster shrinks, as it is not well-described by the ‘Minerals vs. Birds & flowers’ principal component. Instead, the blue ‘Mineral hall’ cluster extends to the left and the ‘Birds’ and ‘Flowers’ clusters in brown and green extend right. Images of the ‘Great Hall’ are also partially described by the minerals/natural quadrant.


The museum can use this data to begin to answer questions about how guests share their museum visits with others on social media. For example, when visitors take portraits or selfies, where do they take them?

For this, we can go back to the first principal component, ‘Selfies & people vs. Objects’, and look at the most common non-human image labels associated with those photos. These include ‘fun’ (which tends to include museum events), ‘recreation’ in the Great Hall, ‘mineral’, and ‘snout’ & ‘mammal’ (guests really like the bears on the 2nd floor and the occasional photo of their dog).

Further exploration

Age of Reptiles

Google Cloud Vision has a second feature that searches for trademarked logos and copy written images within the photos. This feature is able to identify famous works of art inside an image, including the Peabody’s Age of Reptiles mural. The AI is able to positively identify the Age of Reptiles mural in 77 Instagram posts.

There are many potentially valuable uses of this information. For now, we can use this data to help validate our clustering algorithm. We’d expect all images that contain the Age of Reptiles to be grouped in the ‘Great Hall’ cluster and while most are, the mural also appears in a few other clusters:

Cluster Images
B&W photography 1
Exhibits & other 18
Great Hall 52
Outdoors 2
People 3
Torosaurus Statue 1

Visual inspection of these clusters shows that most of these images were taken in the Great Hall, even if they were allocated into other clusters. In some cases, the prevalence of plants and trees in the mural resulted in the image being allocated to the ‘Outside’ cluster. In another case, the trees in the mural along with the ceratopsian skulls resulted in the image being allocated to the ‘Torosaurus statue’ cluster.


At the end of 2016, the Peabody Museum introduced its first official hashtag for visitors to use when posting photos of themselves with the Torosaurus statue: #peabodyselfiesaurus. In this analysis, 103 Instagram posts use this hashtag, though it has been used 194 total times if we count the Instagram posts without the Peabody Museum location tag. Was this a successful marketing tactic? What are the image labels, clusters, and principal components associated with this hashtag?

Images posted with the #peabodyselfiesaurus hashtag primarily are clustered in ‘People’, ‘Torosaurus statue’, and ‘Exhibits & other’, depending on whether the human or the statue is more predominantly featured in the image. For the principal components, most uses of the hashtag are located in the bottom left selfies & people and outside quadrant, though images in the ‘Torosaurus statue’ cluster are concentrated on the right-hand objects side of the chart.

There are a handful of points in the upper left man-made quadrant of the chart. These are instances when the visitors used the #peabodyselfiesaurus hashtag for photos with other Peabody exhibits8.

Future Analysis

Looking at engagements with the Age of Reptiles mural and the #peabodyselfiesaurus hashtag are just two simple examples of using the Cloud Vision AI, clustering, and principal component analysis tools to understand how visitors use social media to share their visits to the museum. This data, along with the other information in each Instagram post (date and time, caption, and hashtags) can be used to get a more complete understanding of visitor engagement. Future analyses can look into:

  • Differences in guest engagement during special events.
    • 12 guests posted 42 photos about the Bones & Beer event in May 2018
    • 9 posts about Fiesta Latina total in 2017 and 2018
    • 23 visitors posted about MLK day in 2018 and 17 posted in 2017
  • Does engagement increase when new exhibits open?
  • Visitor engagement during weekdays and weekends or free admissions vs paid admission days.
  • Text and sentiment analysis of photo captions. Can Instagram posts be considered a form of review for a trip?
  • Relationship between post human-generated labels (hashtags & captions) and the AI-generated labels?
  • A more complete cross-examination of PCA with interactive charts to “rotate” the components.

Learn more about the Peabody Museum. Also check out my series of posts on the data behind the Age of Reptiles mural.

  1. I actually have no clue if the Peabody Museum had a gift shop in 1926.

  2. The Stegosaurus skeleton in the center of the Great Hall has had the incorrect number of tail spikes since 1926. 🤷

  3. There are many routes to extracting data from an image. One option is to study the dominant colors in the images, but this might not help us understand visitor engagement.

  4. My guess is that this reduces the sample size of the analysis, but likely does not reflect differences in motivation or photo content.

  5. 58 Instagram photos are not assigned a cluster as Google Cloud Vision is unable to identify labels. Some of these images are relevant to this analysis, but the AI is not strong enough to identify the contents. One example is my photo of insects trapped in amber, which ideally would be labeled as mineral and insects.

  6. This method of principal component analysis is heavily inspired by Julia Silge’s awesome work using PCA to understand Stack Overflow data.

  7. A very over-simplified explanation.

  8. Overall, this #peabodyselfiesaurus analysis suggests that there is visitor interest in engaging with museum-sanctioned social media marketing and this is something the museum may want to invest more resources into going forward. This specific hashtag, however, might have been too complicated or verbose. 8 visitors, for whatever reason, use the hashtag #selfiesaurus instead.

Ryan Timpe
Data Science | Economics