What the AI methods actually do

Self-supervised audio foundation models — , , the lineage that descends from BEATs and AudioMAE — produce rich embeddings of audio clips. Embeddings cluster geometrically: similar-sounding clips end up close together in high-dimensional space. Cluster boundaries can be assigned labels by humans listening to exemplars. Behavioral-context classifiers can be trained on top of embeddings when synchronized observation data is available. These are real capabilities. None of them is translation. The output of the contemporary pipeline is: a vocal repertoire mapped to a geometry, with cluster-level statistical associations to behavioral contexts. That is much more than the field had five years ago. It is much less than translation requires.

Translation in the sense the popular framing uses it — converting a signal in animal vocalization into an equivalent signal in human language while preserving meaning — requires four things the contemporary methods cannot deliver.

What translation linguistically requires

Translation in the sense the popular framing uses it — converting a signal in animal vocalization into an equivalent signal in human language while preserving meaning — requires four things the contemporary methods cannot deliver. First, evidence that the source signals encode meaning in a stable, compositional way that the receiver decodes. Second, mapping between source-signal components and meaning components that holds across contexts and individuals. Third, evidence that the mapping is the animal's own — that the receiver uses it for the function the translation claims. Fourth, a target representation in human language that the source meaning can survive being mapped to. The third and fourth requirements are the hard ones. The first two are necessary but not sufficient.

The compositionality problem

Human language is famously compositional: 'red ball' means something the speaker and listener can both compute from 'red' plus 'ball' even if they've never heard the combination. Animal communication systems mostly aren't compositional in this sense. Most known animal alarm-call systems are referentially primitive — predator type maps to call type, with little internal structure. The corvid case is interesting because there are statistical hints of structured composition in caw-rattle sequences — Mates[3] et al.'s individual-signature work shows the substrate, and several open-question papers from the Wright lab suggest sequence-order may carry information. But statistical hints are not behavioral evidence. The leap from 'sequence statistics non-random' to 'sequence order carries meaning the bird decodes' requires playback experiments at a scale and care nobody has run for American crows. Until those run, the compositionality claim is undertested.

The semantic-anchor problem

Even if you established compositional structure, you'd need a way to map source components to meaning components. In human-language machine translation, the anchor is parallel corpora: documents that exist in both languages, professionally translated, providing millions of mapping pairs the model can learn from. There is no parallel corpus for crow-to-English. There is no professionally-translated set of crow vocalizations with verified meaning labels. The closest analog is synchronized behavioral observation — and even that gives you cluster-level statistical associations, not per-utterance translations. Project CETI's work on sperm whale codas faces the same problem at a different scale; Andreas[1] et al.'s 2022 roadmap paper is unusually honest about the conceptual difficulty.

The receiver problem

Suppose, generously, that a vocal map identifies that a particular acoustic feature reliably co-occurs with a particular behavioral context. That establishes a sender-side association. It does not establish that the receiver of the call decodes that feature into that meaning. Animals can produce signals consistently without their conspecifics treating those signals as carrying the information the production statistics suggest. Receiver-side decoding requires either physiological measurement (does the receiver's brain respond differently to feature-present vs feature-absent calls in matched contexts) or behavioral playback experiments. Both are hard to do ethically at the scale needed. CrowLingo's atlas reflects sender-side production; receiver-side decoding remains, scientifically, mostly open.

What would actually constitute translation evidence

The minimum standard: a sender-side mapping (call feature X reliably accompanies context Y), a receiver-side mapping (animals exposed to X without context Y respond as if Y is present), reproducibility across populations and individuals, and a target-language representation that preserves the relevant meaning components. Even at this minimum, no animal communication system has been credibly 'translated' in the strong sense. Honeybee waggle dance is the closest case — sender-side mapping is rigorous, receiver-side mapping is established by classic experiments — but waggle dance is referentially shallow and corresponds to spatial coordinates, not what would normally be called language.

Why the framing matters

Calling what AI does in 2026 'translation' is not merely imprecise. It actively misleads about what the science can deliver, sets up the field for credibility-damaging walkbacks, and influences funding decisions away from the slower behavioral work that would actually constitute progress toward the receiver-side problem. The honest framing is: AI has made vocal-repertoire mapping much better. Translation requires receiver-side evidence that mapping work alone cannot produce. Both can be true; both ARE true. The next decade of meaningful work in this space is on the receiver side, using the mapping infrastructure we now have as a tool, not a destination.

The cleanest public version of this position

The Earth Species Project[2]'s launch framing in 2024 was the most honest public statement of these constraints from a peer-leading lab. Andreas[1] et al.'s 2022 paper on the Cetacean Translation Initiative roadmap is the most honest peer-reviewed version. Both name the failure modes — anthropomorphism, over-reach, ecological harm. CrowLingo's editorial floor descends from those framings. We map, we characterize, we narrate cautiously. We don't claim translation, and when popular coverage of our work attributes translation claims to us, we correct it.