Where the dots come from

Start with the audio. Each recording in the corpus gets passed through an audio foundation model — for CrowLingo's v1 atlas, that's NatureLM-audio[1] from Earth Species Project — which produces a vector of numbers describing the recording's acoustic features. The vector is high-dimensional: 1,024 numbers for many models, 1,536 for , somewhere in that range. Each number captures some aspect of how the recording sounds: not in a way humans can read directly, but in a way the model learned during its pretraining on millions of unlabeled audio clips. Two recordings that sound similar to the model will have vectors that are mathematically close together. Two that sound different will be far apart.

1,024 dimensions are unviewable.

Why we need to project

1,024 dimensions are unviewable. There's no 1,024-dimensional graph paper. We can compute distances and similarities in the high-dimensional space, but we can't draw them. — uniform manifold approximation and projection — is one of several techniques for taking high-dimensional vectors and projecting them onto a 2D plane in a way that preserves as much of the original geometric structure as possible. Specifically, UMAP preserves local neighborhood structure: vectors that were close in the original 1,024-dim space stay close in the 2D scatter, even if some global structure gets distorted in the process.

What the axes mean (and don't)

On a typical scatter — including the CrowLingo atlas — the horizontal and vertical axes don't represent any specific named acoustic property. They're not 'pitch' on one axis and 'duration' on the other. They're abstract coordinates that UMAP chose to position the dots in a way that preserves their high-dimensional relationships. If you rotate the plot ninety degrees, nothing meaningful changes. If you flip it horizontally, nothing meaningful changes. The information is in the relationships between dots, not in the absolute positions. This trips up first-time UMAP readers consistently. The axes are not labels of acoustic properties; they're emergent geometry.

What the clusters mean

Dense regions in the 2D scatter — places where many dots cluster together with relatively empty space around them — correspond to acoustically similar groups of recordings. For crow vocalizations, those clusters end up corresponding mostly to call types: territorial caws cluster together, mobbing calls cluster together, juvenile begging clusters together, and so on. The cluster names you see on a labeled atlas are assigned by humans listening to exemplars from each dense region and matching against the descriptive literature's vocabulary. The model doesn't know the names; it knows the geometry. The naming is an interpretive layer on top.

What it means when two dots are close

Two dots close together means: the model thinks these two recordings sound similar in ways that matter for whatever the model was trained to be sensitive to. For an audio foundation model trained on millions of wildlife clips, 'similar in ways that matter' generally means similar species, similar call type, similar acoustic features. Two close dots in the begging cluster are probably both juvenile begging calls. Two close dots in the rattle cluster are probably both rattle vocalizations from similar contexts. The exact reasons can be inspected by examining which acoustic features the recordings share; the headline interpretation is just 'these sound alike to the model.'

What it means when two dots are far

Far means dissimilar in the model's representation. A dot in the begging cluster is far from a dot in the territorial cluster because the model has learned that begging calls and territorial calls differ along many acoustic dimensions — pitch contour, harmonic structure, duration, temporal pattern. The distance isn't a single-number summary; it's a compression of multiple dimensions of difference into a 2D layout. Two recordings could be far apart in the scatter for several reasons that the visualization doesn't distinguish. The plot tells you they differ; for why they differ, you go back to the full and compare specific feature dimensions.

How to use the atlas as a reader

Most useful question: what does this cluster contain? Click any dot in a dense region and the detail panel surfaces a recording, its metadata, its behavioral-context probabilities. The clusters cohere acoustically; their named identities (territorial, mobbing, etc.) are reasonable approximations of what's in them. Less useful question: what does this specific position mean? The exact (x, y) coordinates aren't meaningful as labels; they're meaningful as relational anchors. Useful question: are these two recordings acoustically similar? Eyeball the distance; if they're close, yes; if they're far, no. Cluster-level reading is easier than position-level reading. The atlas is for inspection, not measurement.