FIG 7.1 — Glossary · Reference

Glossary.

Definitions of the technical vocabulary used across CrowLingo — bioacoustics, AI audio models, and corvid behavior. One sentence each. Click any term's deep link to read the methods page where it's explained at length.

Embedding: A learned vector representation of an input. For crow audio: a 1,024-number list a model produces for one clip; similar clips → nearby vectors.
Latent space: The high-dimensional space embeddings live in. Geometry inside it approximates acoustic similarity — close = similar, far = different.
UMAP: Uniform Manifold Approximation and Projection. A non-linear dimensionality reducer that flattens 1,024-dim embeddings to 2-D for visualization.
HDBSCAN: Hierarchical Density-Based Spatial Clustering of Applications with Noise. Finds dense clusters of arbitrary shape without requiring a target cluster count.
Self-supervised learning (SSL): Training paradigm where a model creates its own supervision signal from unlabeled data — e.g. by predicting masked parts of a spectrogram from the rest.
NatureLM-audio: Earth Species Project's audio-language foundation model for bioacoustics (ICLR 2025). Audio in, natural language out. Zero-shot species + behavioral classification.
BirdNET: Open-source deep-learning bird-sound classifier from Cornell Lab + Chemnitz UT. The workhorse for avian detection in long recordings.
Perch: Google's bioacoustics embedding model (PANNs/PaSST lineage). Stronger within-species detail than BirdNET; broader audio coverage than NatureLM-audio.
Syrinx: The bird vocal organ. Located at the bifurcation of the trachea into the bronchi. Two independent sound sources, which is why birds can make two notes at once.
Spectrogram: A time-frequency-intensity plot of an audio clip. Time on the x-axis, frequency on the y, color = energy at (t, f).
Graded calls: Calls that vary continuously along acoustic dimensions rather than belonging to discrete categories. Appear as bridges between clusters on the vocal map.
Dialect: Group-level acoustic variation in calls. Crow family groups show measurable centroid differences in their shared call types — early evidence for crow dialect.
Syntax (in animal communication): Combinatorial structure where call order carries meaning beyond the sum of parts. For crows: statistical regularities exist; behavioral confirmation is thin.

What people ask about this.

What is an audio embedding in bioacoustic AI?

An audio embedding is a learned vector representation of an input audio clip — for crow audio specifically, BirdNET produces a 1,024-number list per clip and Perch 2.0 produces a 1,536-number list. The embedding captures acoustic content (frequency profile, spectral structure, temporal pattern) richer than any hand-engineered feature, learned automatically from massive unlabeled training data via self-supervised objectives. The geometry is what makes the representation useful: similar-sounding clips produce nearby vectors, dissimilar clips produce distant vectors, and downstream tasks like species classification, individual identification, or behavioral-context clustering operate on those distances. Modern bioacoustic AI is fundamentally an embedding-based field; BirdNET, Perch, and NatureLM-audio are all embedding models at their core.

What is UMAP (Uniform Manifold Approximation and Projection) and why is it used for crow vocal atlases?

UMAP is a non-linear dimensionality-reduction algorithm that flattens high-dimensional embeddings (1,024-dim or 1,536-dim audio vectors) down to two or three dimensions for human-inspectable visualization, while preserving local neighborhood structure — meaning points that are similar in the original high-dim space stay close in the 2-D projection. UMAP replaced earlier methods like t-SNE in most bioacoustic visualization work because it preserves both local and global structure better and runs at larger data sizes. CrowLingo's vocal atlas is a UMAP projection of NatureLM-audio embeddings; the cluster geometry visible in the 2-D map approximates real similarity geometry in the underlying high-dim representation. Similarity search itself always runs on the full-dim embedding, never on the 2-D UMAP coordinates.

What is HDBSCAN clustering and how does it find vocal categories without specifying a target count?

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that finds dense regions in high-dimensional space without requiring the user to specify the target cluster count in advance. The algorithm builds a hierarchical structure of density-connected points, identifies stable clusters at multiple density thresholds, and outputs cluster assignments plus a noise label for points that don't fit any cluster confidently. This is exactly the property you want for animal vocal repertoires — you don't know in advance how many vocal categories a species has, you want the data to tell you. CrowLingo's atlas uses HDBSCAN on the full-dim embedding to discover the nine clusters of American crow vocalizations; the cluster boundaries are data-driven, the cluster labels (territorial, mobbing, etc.) are subsequently assigned by matching against the prior descriptive literature.

What is self-supervised learning in audio and why does it matter for bioacoustics?

Self-supervised learning is a training paradigm where a model creates its own supervision signal from unlabeled data — for audio, typically by predicting masked or hidden portions of a spectrogram from the surrounding context, or by learning that two clips from the same source should have similar representations while clips from different sources differ. The model never sees an explicit human label during pretraining; the supervision emerges from the structure of the data itself. This matters for bioacoustics because labeled wildlife audio is scarce and expensive, while unlabeled wildlife audio exists in enormous quantities (Macaulay Library has 1.3M+ recordings). Self-supervised pretraining on the large unlabeled corpus produces rich representations that downstream tasks can use with much less labeled fine-tuning data. BirdNET, Perch 2.0, and NatureLM-audio are all built on this paradigm; it's how the modern bioacoustic AI revolution actually works.