The fifty-year regime
From the 1970s through roughly 2020, the dominant method for mapping a bird's vocal repertoire was hand-engineered acoustic profiling. A researcher would record a population, isolate calls, measure features by hand or with semi-automated tools — mean fundamental frequency, peak amplitude, duration, harmonic-to-noise ratio — and cluster the results with classical statistics. The 2014 paper by Mates[1], Tarter, Ha, Clark, and McGowan on American crow caw profiling is the canonical example of this regime done well: it showed that a single caw, measured carefully, carries caller sex and individual identity. The result is robust and reproducible, and every modern method takes it as a baseline. But the features being measured were chosen by humans. The framework could only see what its measurement vocabulary admitted. Graded variation, dialect, individual signature — all were present in the audio; only some made it into the labels.
Self-supervised learning happened.
What changed
Self-supervised learning happened. By 2022, audio foundation models — trained without labels on the entire Xeno-canto corpus, the Macaulay Library, the FreeSound archive — were producing embeddings rich enough that downstream tasks like species classification could be solved zero-shot. The fundamental trick was simple in retrospect: take a , mask part of it, ask the model to predict the masked portion from what remains. Do this millions of times across diverse wildlife audio, and the model learns acoustic representations that human-engineered features can't match. The downstream effect: instead of a half-dozen hand-picked measurements per clip, every clip becomes a 1,024- or 1,536-dimensional vector. Similar clips sit close together in that high-dimensional space. Different clips sit far apart. Cluster boundaries emerge from the geometry without anyone specifying what to look for.
The three models worth knowing
BirdNET[2], from Cornell Lab of Ornithology, is the workhorse. Released by Kahl et al. in 2021, it runs on phones, smart speakers, and long-term acoustic monitoring deployments. Its EfficientNet-B0 backbone produces a 1,024-dim biased toward species-discriminative features. It's not the most sophisticated model anymore — but it's the one that scaled, and the design choices it made shaped every successor. , from Google Research, is the within-species detail king. Released in 2025 with an EfficientNet-B3 backbone and roughly 12M parameters, Perch produces a (5, 3, 1,536) embedding that captures graded variation smooths over. Open weights on Hugging Face, an ONNX port for edge inference. For mapping the geometry of one species' repertoire — the CrowLingo problem — Perch is the published state of the art. NatureLM-audio[3], from Earth Species Project, is the audio-language model. Presented at ICLR 2025 by Robinson, Miron, Hagiwara, Pietquin and colleagues, it combines a BEATs audio encoder with a Llama-3.1-8B-Instruct language backbone. You can ask it natural-language questions about audio clips — "what species," "how many," "what life stage" — and get useful answers zero-shot. SOTA on the BEANS-Zero benchmark.
From embedding to atlas
An by itself is unviewable. 1,024 dimensions don't fit on a screen. The standard move is to project to two dimensions for inspection using — uniform manifold approximation and projection — which preserves local neighborhood structure while flattening the global geometry. Calls that were close in the full embedding stay close in the 2D scatter; calls that were far stay far. UMAP isn't unique — t-SNE projects similarly, PaCMAP differently, PCA loses too much — but it's become the default because it preserves the relationships that matter to a human inspector. The atlas you see at /the-crow/repertoire-atlas is the UMAP projection of a NatureLM-audio[3] embedding of our v1 CC-licensed corpus, with cluster boundaries assigned by on the full vectors. The caveat: the 2D positions are for inspection. Nearest-neighbor queries and similarity search always run on the full 1,024-dimensional vectors.
What we can decode now
Caller sex, individual identity, behavioral context — all recoverable from a single half-second of crow voice with -based pipelines. Pitch contour is the most reliable sex tag, harmonic emphasis fingerprints individuals, duration and inter-call interval shift with context, spectral grain tracks arousal. None of these dimensions is new. Researchers documented sex- and identity-encoded features in crow caws over a decade ago. What's new is that the AI extracts all of them simultaneously, automatically, on millions of calls, and locates each call as a single point in a shared map where similarity is geometry. The territorial caw cluster differs from the mobbing cluster not by one variable but by a thousand correlated micro-differences.
What we cannot decode (yet)
Lexical meaning. Compositional syntax. Anything that would deserve the word translation. Statistical models hint at structured composition in crow vocal sequences — caw-rattle combinations are non-random — but the behavioral evidence that crows treat sequence order as carrying meaning beyond the sum of its parts is thin. This is where the next five years of corvid bioacoustic research will live: not in better embeddings, but in synchronized behavioral observation rigorous enough to anchor what the embeddings see. The 2026 Demartsev[4] paper on carrion crows, using wearable bioacoustic loggers to capture 127,000+ vocalizations with synchronized behavior, is the cleanest published example of where the field is going.
Why crows
Choosing a species for animal-language work isn't neutral. American crows clear several relevant bars: cognitive capacity to use communication for more than reflexive signaling, stable family groups that make individual recognition meaningful, demonstrated cultural transmission of information across generations, lifespans long enough that individual histories matter. The Marzluff[5] mask experiments showed crows recognize and remember individual humans for years and pass face-recognition information socially across family groups. Heinrich's slow work on ravens (a sister species) catalogued cross-generational learning. None of this is sufficient evidence for crow language. All of it is necessary scaffolding for any program that takes the question seriously. CrowLingo is, by design, only about crows. The methods generalize; the social baseline doesn't.
The honest summary
AI didn't translate crow language in 2026. AI made it possible to catalog the language at scale — to map every vocalization onto a single inspectable geometry, to surface graded variation that hand-labeling missed, to identify individuals and behavioral contexts in milliseconds rather than weeks. Translation, if it comes, will not come from this generation of models. It will come from the next decade of synchronized behavioral observation built on top of the mapping infrastructure we have now. The honest takeaway, from inside the field: we have a map. We do not yet have a dictionary. Be skeptical of any claim that we do.