Methods · Sub-page
Latent space 101.
The minimum mental model you need to read a vocal atlas — and to argue with the people who built it.
An embedding is a list of numbers
An is what a model produces when you give it an input. For audio: feed in a 30-second clip, get out a list of (commonly) 1,024 floating-point numbers. Each number is a coordinate. Each coordinate is a direction the model learned to care about.
You will never need to interpret those numbers individually. No one does. The interesting questions are geometric: how far is this embedding from that one? Which embeddings cluster? Which embeddings change predictably when the input changes?
The space they live in
The is the 1,024-dimensional space these vectors live in. "1,024-dimensional" just means each vector has 1,024 coordinates. You can't picture it. That's fine — you can still compute distances and angles inside it, and those numbers are what matter.
The crucial property: the model arranges the space so that acoustically similar inputs land near each other. Different calls of the same type cluster. Different types separate. Graded transitions show up as bridges. Individual signatures show up as substructure within clusters.
Why we collapse to 2-D
Human eyes only read 2-D well. To seewhat's in the space, we project it down. (and similar tools — t-SNE, PaCMAP) compress 1,024 dimensions to 2 by preserving local neighborhoods as faithfully as the lower dimension allows.
The 2-D map is a summary, not a ground truth. Distances in the projection only loosely correspond to distances in the original space — local ones are usually right, global ones are often misleading. Use projections to inspect data; do real math (clustering, similarity) in the full space.
The parameter that changes everything
UMAP's most consequential knob is n_neighbors. Small values (5–10) emphasize fine local structure — you see tight clusters and lots of separation. Large values (50–100) smooth out into a more global view — you see continents instead of villages.
The same data with different n_neighbors produces visually different maps. There is no "right" setting; pick the one that surfaces the structure you're testing, and always disclose it. Interactive comparison across settings lands in v1.
Reading a vocal map
Treat dense regions as candidate call categories. Treat bridges as graded variation. Treat outliers as either encoding artifacts or genuine novelty — listen to a few before deciding which. The map is a hypothesis-generator, not a verdict.