Skip to content
CrowLingo

Methods · Sub-page

Latent space 101.

The minimum mental model you need to read a vocal atlas — and to argue with the people who built it.

AI narration · Methods · Latent space 101

A latent space is the high-dimensional geometry the model lives in. Every vocalization becomes a point. Distances between points correspond to acoustic similarity. Directions in the space correspond to dimensions of variation the model found important — sometimes interpretable, often not. The territorial caw cluster and the begging cluster sit far apart in this space because the model learned, without supervision, that those sounds differ. UMAP projects that geometry down to two dimensions for inspection, but the real arithmetic — similarity search, nearest neighbors, cluster discovery — always runs on the full vector. Treat the 2D atlas as a window, not a database.

AI interpretation, not translation.

An embedding is a list of numbers

An is what a model produces when you give it an input. For audio: feed in a 30-second clip, get out a list of (commonly) 1,024 floating-point numbers. Each number is a coordinate. Each coordinate is a direction the model learned to care about.

You will never need to interpret those numbers individually. No one does. The interesting questions are geometric: how far is this embedding from that one? Which embeddings cluster? Which embeddings change predictably when the input changes?

The space they live in

The is the 1,024-dimensional space these vectors live in. "1,024-dimensional" just means each vector has 1,024 coordinates. You can't picture it. That's fine — you can still compute distances and angles inside it, and those numbers are what matter.

The crucial property: the model arranges the space so that acoustically similar inputs land near each other. Different calls of the same type cluster. Different types separate. Graded transitions show up as bridges. Individual signatures show up as substructure within clusters.

Why we collapse to 2-D

Human eyes only read 2-D well. To seewhat's in the space, we project it down. (and similar tools — t-SNE, PaCMAP) compress 1,024 dimensions to 2 by preserving local neighborhoods as faithfully as the lower dimension allows.

The 2-D map is a summary, not a ground truth. Distances in the projection only loosely correspond to distances in the original space — local ones are usually right, global ones are often misleading. Use projections to inspect data; do real math (clustering, similarity) in the full space.

The parameter that changes everything

UMAP's most consequential knob is n_neighbors. Small values (5–10) emphasize fine local structure — you see tight clusters and lots of separation. Large values (50–100) smooth out into a more global view — you see continents instead of villages.

The same data with different n_neighborsproduces visually different maps. There is no "right" setting; pick the one that surfaces the structure you're testing, and always disclose it. Interactive comparison across settings lands in v1.

Reading a vocal map

Treat dense regions as candidate call categories. Treat bridges as graded variation. Treat outliers as either encoding artifacts or genuine novelty — listen to a few before deciding which. The map is a hypothesis-generator, not a verdict.

Frequently asked

What people ask about this.

What is a latent space?
A latent space is the high-dimensional vector space a model lives in after it embeds an input. Every audio clip becomes a point; distances between points correspond to acoustic similarity; directions in the space correspond to dimensions of variation the model found important. UMAP projects that geometry down to 2D for inspection.
Why are nearest-neighbor queries done in high-dim, not 2D?
UMAP preserves local neighborhood structure for visualization but loses information in projection. Similarity search and nearest-neighbor queries always run on the full 1,024-or-1,536-dim embedding to keep the geometry honest.