Methods · Compare

BirdNET vs Perch 2.0 vs NatureLM-audio.

Three audio foundation models dominate bioacoustics in 2026. BirdNET is the workhorse detector that runs on phones. Perch 2.0 is the SOTA within-species embedding model from Google Research. NatureLM-audio is the Earth Species Project's audio-language model that answers natural-language questions about clips. Pick by the task, not the brand.

AI narration · Methods · NatureLM-audio

NatureLM-audio is the Earth Species Project's audio-language foundation model, presented at ICLR 2025. It combines a BEATs audio encoder with a Llama-3.1 language backbone, trained on a curated bioacoustic corpus drawn from Xeno-canto, the Macaulay Library, and other open archives. Crucially, it answers natural-language questions about audio clips zero-shot: 'What species is this?', 'How many individuals are calling?', 'What life stage?'. SOTA on BEANS-Zero, the benchmark Masato Hagiwara's group built to evaluate exactly that capability. The atlas's NatureLM-audio attribution is not branding — it's the model whose embeddings would back a production version of our similarity search.

The comparison

Architecture, training, output, license, when to use which.

Dimension	BirdNET	Perch 2.0	NatureLM-audio
Year released	2021 (v1) · 2024 (v2.4)	2025 (v2.0)	ICLR 2025
Architect / org	Stefan Kahl et al. · Cornell Lab of Ornithology	Tom Denton, Hamer, Wang et al. · Google Research	David Robinson, Marius Miron, Masato Hagiwara et al. · Earth Species Project
Architecture	EfficientNet-B0 backbone, mel-spectrogram input	EfficientNet-B3, ~12M params, multi-task heads	BEATs audio encoder + Llama-3.1-8B-Instruct language model
Training corpus	~3,000 species, ~80,000 hours Xeno-canto + Macaulay	Xeno-canto + Macaulay (curated bioacoustic)	Curated bioacoustic mix + audio-language pairs
Output type	Species classification + 1,024-dim embedding	Classification + (5, 3, 1,536) embedding	Natural-language answers (zero-shot Q&A)
Embedding dim	1,024	1,536 (mean-pooled)	Token-level, not a fixed pooled vector
Strength on corvids	Workhorse · fast · biased toward species-discriminative features	Best within-species detail · current SOTA for graded-call work	Best for natural-language queries · zero-shot captioning
Inference cost	CPU-friendly · runs on phones	GPU-friendly · ONNX port for edge	Heavy · needs an A10G or A100 in practice
Open license	MIT (open source)	Open weights (Hugging Face)	Open weights (Hugging Face)
Best for	Real-time detection · long-term monitoring · phone apps	Similarity search · within-species clustering · CrowLingo atlas	Plain-English Q&A about clips · BEANS-Zero benchmark

When to use which

Three jobs, three models.

Detection at scale → BirdNET: When you have hours of audio and want to find every crow call, BirdNET's combination of speed, CPU support, and the massive Xeno-canto training corpus makes it the default. It's why Merlin runs BirdNET and why every long-term acoustic monitoring deployment from passive recorders does too.
Within-species detail → Perch 2.0: Once you have crow calls isolated, Perch 2.0's embedding geometry captures graded variation BirdNET smooths over. This is what a production version of CrowLingo's similarity search (“find calls like this one”) would run on. Open weights, ONNX port for edge.
Plain-English Q&A → NatureLM-audio: When the question is “what species is this” or “how many individuals are calling” or “describe this clip,” you don't want an embedding, you want an answer. NatureLM-audio gives you one zero-shot. Heavy compute, niche task, but irreplaceable when you need it.

References

BirdNET vs Perch 2.0 vs NatureLM-audio.

Architecture, training, output, license, when to use which.

Three jobs, three models.

What people ask about this.