Methods · Sub-page

NatureLM-audio.

The first audio-language foundation model designed for bioacoustics. Audio in, natural language out. The workflow shift you can feel.

AI narration · Methods · NatureLM-audio

NatureLM-audio is the Earth Species Project's audio-language foundation model, presented at ICLR 2025. It combines a BEATs audio encoder with a Llama-3.1 language backbone, trained on a curated bioacoustic corpus drawn from Xeno-canto, the Macaulay Library, and other open archives. Crucially, it answers natural-language questions about audio clips zero-shot: 'What species is this?', 'How many individuals are calling?', 'What life stage?'. SOTA on BEANS-Zero, the benchmark Masato Hagiwara's group built to evaluate exactly that capability. The atlas's NatureLM-audio attribution is not branding — it's the model whose embeddings would back a production version of our similarity search.

What it is

is a multimodal foundation model from Earth Species Project, published at ICLR 2025 (Robinson, Miron, Hagiwara, Pietquin et al., arXiv:2411.07186). The architecture pairs an audio encoder with a language model: audio frames flow into a shared representation, language flows out as captions, classifications, or descriptions.

Weights are publicly available on Hugging Face at EarthSpeciesProject/NatureLM-audio. The model card and the BEANS-Zero benchmark it was evaluated against are linked from the repo.

What it does well

Zero-shot species detection.Ask "is there a crow in this clip?" without training a classifier.
Zero-shot behavioral classification.Ask "is this a territorial call?" using natural-language prompts.
Captioning.Generate a one-sentence description of what's on the audio.
Embedding extraction. Pull the encoder representations for downstream similarity, clustering, retrieval.
Cross-species transfer. Methods that work on crow audio frequently transfer to ravens, jays, magpies — sometimes with no retraining at all.

What it doesn't do

NatureLM-audio is a captioning and classification model. It does notgenerate audio. It does not perform real-time synthesis. It does not produce a "crow dictionary" mapping calls to human-language glosses with semantic precision — what it captions reflects training priors, not crow intent. Treat captions as starting points for investigation, not as translations.

How it changed the workflow

Before NatureLM-audio, building a useful crow-vocalization pipeline meant training a custom classifier for each question — a multi-week process per question. After: prompt-engineering. You ask the model questions in English; it answers; you validate against held-out ground truth. A single afternoon of iteration replaces weeks of model building.

That shift is the workflow story of bioacoustics in 2025–26. Custom classifiers still happen — for high-stakes, high-volume tasks where every percentage point of accuracy matters. But the median question is now a prompt.

We are not trying to translate animal communication. We are trying to build the most useful tool we can, and use it carefully, and publish what we find.

— Robinson et al. (2025) · NatureLM-audio: An Audio-Language Foundation Model for Bioacoustics

NatureLM-audio.

What it is

What it does well

What it doesn't do

How it changed the workflow

What people ask about this.