Skip to content
CrowLingo

Methods · Sub-page

NatureLM-audio.

The first audio-language foundation model designed for bioacoustics. Audio in, natural language out. The workflow shift you can feel.

AI narration · Methods · NatureLM-audio

NatureLM-audio is the Earth Species Project's audio-language foundation model, presented at ICLR 2025. It combines a BEATs audio encoder with a Llama-3.1 language backbone, trained on a curated bioacoustic corpus drawn from Xeno-canto, the Macaulay Library, and other open archives. Crucially, it answers natural-language questions about audio clips zero-shot: 'What species is this?', 'How many individuals are calling?', 'What life stage?'. SOTA on BEANS-Zero, the benchmark Masato Hagiwara's group built to evaluate exactly that capability. The atlas's NatureLM-audio attribution is not branding — it's the model whose embeddings would back a production version of our similarity search.

AI interpretation, not translation.

What it is

is a multimodal foundation model from Earth Species Project, published at ICLR 2025 (Robinson, Miron, Hagiwara, Pietquin et al., arXiv:2411.07186). The architecture pairs an audio encoder with a language model: audio frames flow into a shared representation, language flows out as captions, classifications, or descriptions.

Weights are publicly available on Hugging Face at EarthSpeciesProject/NatureLM-audio. The model card and the BEANS-Zero benchmark it was evaluated against are linked from the repo.

What it does well

  • Zero-shot species detection.Ask "is there a crow in this clip?" without training a classifier.
  • Zero-shot behavioral classification.Ask "is this a territorial call?" using natural-language prompts.
  • Captioning.Generate a one-sentence description of what's on the audio.
  • Embedding extraction. Pull the encoder representations for downstream similarity, clustering, retrieval.
  • Cross-species transfer. Methods that work on crow audio frequently transfer to ravens, jays, magpies — sometimes with no retraining at all.

What it doesn't do

NatureLM-audio is a captioning and classification model. It does notgenerate audio. It does not perform real-time synthesis. It does not produce a "crow dictionary" mapping calls to human-language glosses with semantic precision — what it captions reflects training priors, not crow intent. Treat captions as starting points for investigation, not as translations.

How it changed the workflow

Before NatureLM-audio, building a useful crow-vocalization pipeline meant training a custom classifier for each question — a multi-week process per question. After: prompt-engineering. You ask the model questions in English; it answers; you validate against held-out ground truth. A single afternoon of iteration replaces weeks of model building.

That shift is the workflow story of bioacoustics in 2025–26. Custom classifiers still happen — for high-stakes, high-volume tasks where every percentage point of accuracy matters. But the median question is now a prompt.

We are not trying to translate animal communication. We are trying to build the most useful tool we can, and use it carefully, and publish what we find.
Robinson et al. (2025) · NatureLM-audio: An Audio-Language Foundation Model for Bioacoustics

Frequently asked

What people ask about this.

What is NatureLM-audio?
NatureLM-audio is Earth Species Project's audio-language foundation model for bioacoustics, presented at ICLR 2025. It combines a BEATs audio encoder with a Llama-3.1-8B-Instruct language backbone and answers natural-language questions about audio clips zero-shot. SOTA on the BEANS-Zero benchmark.
What can NatureLM-audio actually do?
Zero-shot species classification, focal-bird identification, count estimation, life-stage classification, and natural-language captioning of arbitrary bioacoustic clips — without requiring task-specific fine-tuning. It's the first model where you can ask 'how many crows are calling' and get a useful answer.