Stage 1: Capture
Every vocal map starts with a microphone in front of a crow. Most public corpora — Wikimedia Commons, Xeno-canto, the Macaulay Library — were built with handheld parabolic mics aimed at calling birds, recorded at 44 or 48 kilohertz. That's enough bandwidth to cover the full crow vocal range, from the 200-hertz bottom of the loudest caws to the 8-kilohertz top of juvenile begging. The harder problem isn't sample rate; it's behavioral context. A recording without notes — where, when, what the bird was doing, whether it was alone or with kin — is acoustically usable but semantically thin. That's why the wearable-logger studies of the 2020s, Demartsev[4]'s carrion crow work especially, are quietly revolutionary. Failure mode: capturing audio without synchronized behavioral observation, then claiming behavioral interpretations downstream that the data can't support.
A field recording is 99 percent silence, wind, and other species.
Stage 2: Detect
A field recording is 99 percent silence, wind, and other species. Detection narrows it down. The workhorse is BirdNET[1] — Stefan Kahl's 2021 model that runs on phones, smart speakers, and long-term monitoring deployments — which slides a window across the audio and emits 'crow likely' probabilities. False positives cluster around other corvid species and a handful of mimics; false negatives cluster around quiet or distant calls. For and the newer NatureLM-audio[2] family, detection becomes a feature of the same model that does the — one forward pass, two outputs. The cost is opacity; the gain is speed. Failure mode: trusting detection-stage confidence scores as ground truth without spot-checking a sample by ear.
Stage 3: Preprocess
Once a clip is isolated, it needs to look like the training distribution of the model that will embed it. That means a bandpass filter to drop traffic and HVAC rumble below 150 hertz, a mild high-shelf cut above 10 kilohertz to suppress hiss, peak normalization to a target loudness so a loud caw and a quiet rattle land at comparable amplitudes, and chunking to the model's expected window length — five seconds for , three for BirdNET[1], ten for some BEANS evaluations. None of this is sophisticated. All of it determines whether your downstream clusters are real or artifacts of preprocessing variance. Failure mode: heavy denoising that removes the spectral grain SSL models actually want to read; if you can hear the call clearly, the model can too — stop preprocessing.
Stage 4: Embed
This is the part that actually changed in 2023. Older bioacoustic pipelines hand-crafted features — mean frequency, peak amplitude, formant structure — and called that an acoustic profile. Foundation models don't. and NatureLM-audio[2] take the raw waveform, run it through a transformer trained on millions of unlabeled wildlife recordings, and emit a 1,024-or-1,536-dimensional vector. The geometry of that vector space carries far more information than any hand-engineered profile ever did. Two crow calls close together in the sound similar; calls in distant regions sound different in ways even human listeners might struggle to articulate. That geometry is the substrate for everything that follows. Failure mode: mixing embeddings from different encoders in the same or similarity search — embedding spaces are not interoperable.
Stage 5: Project
1,536 dimensions is unviewable. — uniform manifold approximation and projection — flattens them to two while preserving local neighborhood structure. The result is the map you see in the atlas: dots that were close in the full stay close in the 2D scatter, dots that were far stay far. UMAP isn't unique — t-SNE projects similarly, PaCMAP differently, PCA loses too much — but it's become the default because it preserves the relationships that matter to a human inspector. The caveat: the 2D positions are for inspection, not for similarity search. Nearest-neighbor queries always run on the full 1,536-dimensional vectors. Failure mode: treating the UMAP layout as a database; making decisions based on 2D distances rather than full-embedding cosine similarity.
Stage 6: Cluster
Clustering runs on the full embeddings, not the projection. — hierarchical density-based spatial clustering — finds dense regions in the high-dimensional space and labels them. Density is the right primitive: it allows variable-density clusters and gracefully ignores outliers as noise rather than forcing them into a category they don't fit. On American crow embeddings, HDBSCAN converges on roughly nine to twelve clusters depending on minimum-samples parameters. The cluster identities — territorial, mobbing, rattle, begging — are then assigned post-hoc by listening to exemplar clips from each dense region. The names come from human biologists; the boundaries come from the model. Failure mode: choosing the cluster count in advance, K-means style, instead of letting density determine it.
Stage 7: Contextualize
A cluster is a geometric region. A behavioral interpretation requires synchronized observation. When researchers record both audio AND behavior — who's calling, what they're doing, whether predators are present, whether kin are nearby — they can train a secondary classifier on top of the : given this vector, what behavioral context did the recording metadata report? That's how clusters get probability distributions across categories like territorial, alarm, recruitment, affiliative. The Wright laboratory and Demartsev[4]'s group have done this rigorously for corvids. The probabilities on CrowLingo's atlas are downstream of that line of work — never per-clip ground truth, always cluster-wide estimates. Failure mode: presenting cluster-wide probability estimates as per-clip ground truth in the user-facing copy.
Stage 8: Respond
The last stage is the one with the most ethical weight: now that you have a vocal map and a cluster interpretation, what do you do with it? Possible answers range from inert and useful — improved species monitoring, conservation telemetry, public education like this site — to interventionist and risky: playing recorded crow calls back to wild birds, deploying generated audio in the field, claiming translation in popular press. Earth Species Project's NatureLM-audio[2] launch post explicitly named the risk of over-promising. CrowLingo's editorial floor is the same: catalog, characterize, narrate cautiously. Translation, if it comes, will not come from this generation of models. Failure mode: shipping a 'translate this crow' feature before the science can defend it.