Reading the vocal atlas: a beginner's guide to UMAP for bird sounds

Where the dots come from

Start with the audio. Each recording in the corpus gets passed through an audio foundation model — for CrowLingo's v1 atlas, that's NatureLM-audio^[1] from Earth Species Project — which produces a vector of numbers describing the recording's acoustic features. The vector is high-dimensional: 1,024 numbers for many models, 1,536 for , somewhere in that range. Each number captures some aspect of how the recording sounds: not in a way humans can read directly, but in a way the model learned during its pretraining on millions of unlabeled audio clips. Two recordings that sound similar to the model will have vectors that are mathematically close together. Two that sound different will be far apart.

1,024 dimensions are unviewable.

Why we need to project

1,024 dimensions are unviewable. There's no 1,024-dimensional graph paper. We can compute distances and similarities in the high-dimensional space, but we can't draw them. — uniform manifold approximation and projection — is one of several techniques for taking high-dimensional vectors and projecting them onto a 2D plane in a way that preserves as much of the original geometric structure as possible. Specifically, UMAP preserves local neighborhood structure: vectors that were close in the original 1,024-dim space stay close in the 2D scatter, even if some global structure gets distorted in the process.

What the axes mean (and don't)

On a typical scatter — including the CrowLingo atlas — the horizontal and vertical axes don't represent any specific named acoustic property. They're not 'pitch' on one axis and 'duration' on the other. They're abstract coordinates that UMAP chose to position the dots in a way that preserves their high-dimensional relationships. If you rotate the plot ninety degrees, nothing meaningful changes. If you flip it horizontally, nothing meaningful changes. The information is in the relationships between dots, not in the absolute positions. This trips up first-time UMAP readers consistently. The axes are not labels of acoustic properties; they're emergent geometry.

What the clusters mean

Dense regions in the 2D scatter — places where many dots cluster together with relatively empty space around them — correspond to acoustically similar groups of recordings. For crow vocalizations, those clusters end up corresponding mostly to call types: territorial caws cluster together, mobbing calls cluster together, juvenile begging clusters together, and so on. The cluster names you see on a labeled atlas are assigned by humans listening to exemplars from each dense region and matching against the descriptive literature's vocabulary. The model doesn't know the names; it knows the geometry. The naming is an interpretive layer on top.

What it means when two dots are close

Two dots close together means: the model thinks these two recordings sound similar in ways that matter for whatever the model was trained to be sensitive to. For an audio foundation model trained on millions of wildlife clips, 'similar in ways that matter' generally means similar species, similar call type, similar acoustic features. Two close dots in the begging cluster are probably both juvenile begging calls. Two close dots in the rattle cluster are probably both rattle vocalizations from similar contexts. The exact reasons can be inspected by examining which acoustic features the recordings share; the headline interpretation is just 'these sound alike to the model.'

What it means when two dots are far

Far means dissimilar in the model's representation. A dot in the begging cluster is far from a dot in the territorial cluster because the model has learned that begging calls and territorial calls differ along many acoustic dimensions — pitch contour, harmonic structure, duration, temporal pattern. The distance isn't a single-number summary; it's a compression of multiple dimensions of difference into a 2D layout. Two recordings could be far apart in the scatter for several reasons that the visualization doesn't distinguish. The plot tells you they differ; for why they differ, you go back to the full and compare specific feature dimensions.

How to use the atlas as a reader

Most useful question: what does this cluster contain? Click any dot in a dense region and the detail panel surfaces a recording, its metadata, its behavioral-context probabilities. The clusters cohere acoustically; their named identities (territorial, mobbing, etc.) are reasonable approximations of what's in them. Less useful question: what does this specific position mean? The exact (x, y) coordinates aren't meaningful as labels; they're meaningful as relational anchors. Useful question: are these two recordings acoustically similar? Eyeball the distance; if they're close, yes; if they're far, no. Cluster-level reading is easier than position-level reading. The atlas is for inspection, not measurement.

Quick answers from this piece.

What do the axes on a UMAP plot mean?

Nothing specific. UMAP axes are abstract coordinates that the algorithm chose to position dots in a way that preserves their high-dimensional relationships. They don't represent named acoustic properties like pitch or duration. The information is in the relationships between dots, not in absolute positions.

Why are there clusters in the vocal atlas?

Dense regions of the 2D scatter correspond to acoustically similar groups of recordings. For crows, those clusters mostly correspond to call types: territorial, mobbing, assembly, rattle, begging, etc. The cluster names are assigned by humans matching exemplars against the descriptive literature; the model knows the geometry, humans assign the names.

Is the UMAP scatter the same as the full embedding?

No. UMAP is a 2D projection of a 1,024-or-1,536-dim embedding space. It preserves local neighborhood structure but loses some global geometry. For visualization, the projection is useful; for similarity search and nearest-neighbor queries, work with the full high-dimensional vectors.

Where the dots come from

Why we need to project

What the axes mean (and don't)

What the clusters mean

What it means when two dots are close

What it means when two dots are far

How to use the atlas as a reader

Quick answers from this piece.

What do the axes on a UMAP plot mean?

Why are there clusters in the vocal atlas?

Is the UMAP scatter the same as the full embedding?

Reading the vocal atlas: a beginner's guide to UMAP for bird sounds

Where the dots come from

Why we need to project

What the axes mean (and don't)

What the clusters mean

What it means when two dots are close

What it means when two dots are far

How to use the atlas as a reader

Quick answers from this piece.

Cited in this piece.

How AI is decoding crow vocalizations in 2026

The nine emergent clusters of American crow communication

From caw to cluster: the eight-stage pipeline explained

Reading the vocal atlas: a beginner's guide to UMAP for bird sounds

Where the dots come from

Why we need to project

What the axes mean (and don't)

What the clusters mean

What it means when two dots are close

What it means when two dots are far

How to use the atlas as a reader

Quick answers from this piece.

Cited in this piece.

How AI is decoding crow vocalizations in 2026

The nine emergent clusters of American crow communication

From caw to cluster: the eight-stage pipeline explained

Where the dots come from

Why we need to project

What the axes mean (and don't)

What the clusters mean

What it means when two dots are close

What it means when two dots are far

How to use the atlas as a reader

Quick answers from this piece.

Cited in this piece.

People who read this also read

How AI is decoding crow vocalizations in 2026

The nine emergent clusters of American crow communication

From caw to cluster: the eight-stage pipeline explained

Where the dots come from

Why we need to project

What the axes mean (and don't)

What the clusters mean

What it means when two dots are close

What it means when two dots are far

How to use the atlas as a reader

Quick answers from this piece.

Cited in this piece.

People who read this also read

How AI is decoding crow vocalizations in 2026

The nine emergent clusters of American crow communication

From caw to cluster: the eight-stage pipeline explained