Decoding · Sub-page
Contextual clustering — geometry meets behavior.
A cluster is just a number until you join it to what the crow was doing. The join is where geometry turns into meaning.
AI narration · Decoding · Contextual clustering
A cluster on the vocal map is a geometric region in the embedding space. A behavioral context is the activity that was occurring when the recording was made — mobbing, foraging, paired calling, parent-offspring exchange. When the two are joined, the model can be asked: for any new vocalization, what's the probability distribution across contexts? The 2026 Demartsev paper on carrion crows did this rigorously using wearable bioacoustic loggers that captured behavior and audio simultaneously. The probabilities on CrowLingo's atlas are downstream of that line of work — cluster-wide estimates, never per-clip ground truth, but useful enough to make the clusters interpretable.

The join
The pipeline (stages 5–7) produces two tables: an audio table with cluster IDs, and a behavior table with timestamped observations. Joined by time window — usually a few-hundred-millisecond tolerance around the call — they yield a third table: per cluster, the distribution of behaviors that co-occurred.
The shape of that distribution is the signal. A cluster whose calls occur 80% during territorial defense and 5% during foraging is doing something different from a cluster that splits evenly across contexts. The first is interpretable; the second is either an encoding artifact or a genuinely context-generic call type (greetings, contact).
What the wearable-logger work showed
The 2026 carrion-crow paper (Demartsev et al., bioRxiv) is the cleanest recent example. The team deployed wearable audio loggers on a cooperatively breeding crow population, capturing audio and accelerometry per individual. The behavior log was the accelerometer trace, time-aligned to the second.
When they clustered the vocal embeddings and joined to the accelerometry-derived behavioral states, they recovered both the discrete repertoire structure (clusters that map cleanly to a single behavior) and graded structure (grunts that vary continuously with motor activity). The latter is the part that was invisible before — graded variation that the old hand-labeling regime had to either squeeze or discard.
What this is not
It is not "the crow says X means Y." The joined-distribution captures co-occurrence, not semantics. We don't know whether the call causes the behavior, describes the behavior, or simplyaccompanies the behavioral state.
Distinguishing those would require intervention — playback experiments with calibrated control — which the ethics floor constrains heavily and which the Respond stage only barely starts to address.
The honest interpretation
Contextual clustering gives you a probabilistic map from acoustic form to behavioral context. That map is a strong foundation for designing playback experiments, for spotting outlier calls worth investigating, and for distinguishing repertoire change over time. It is not, by itself, a dictionary.