Methods · Compare
BirdNET vs Perch 2.0 vs NatureLM-audio.
Three audio foundation models dominate bioacoustics in 2026. BirdNET is the workhorse detector that runs on phones. Perch 2.0 is the SOTA within-species embedding model from Google Research. NatureLM-audio is the Earth Species Project's audio-language model that answers natural-language questions about clips. Pick by the task, not the brand.
AI narration · Methods · NatureLM-audio
NatureLM-audio is the Earth Species Project's audio-language foundation model, presented at ICLR 2025. It combines a BEATs audio encoder with a Llama-3.1 language backbone, trained on a curated bioacoustic corpus drawn from Xeno-canto, the Macaulay Library, and other open archives. Crucially, it answers natural-language questions about audio clips zero-shot: 'What species is this?', 'How many individuals are calling?', 'What life stage?'. SOTA on BEANS-Zero, the benchmark Masato Hagiwara's group built to evaluate exactly that capability. The atlas's NatureLM-audio attribution is not branding — it's the model whose embeddings would back a production version of our similarity search.
The comparison
Architecture, training, output, license, when to use which.
| Dimension | BirdNET | Perch 2.0 | NatureLM-audio |
|---|---|---|---|
| Year released | 2021 (v1) · 2024 (v2.4) | 2025 (v2.0) | ICLR 2025 |
| Architect / org | Stefan Kahl et al. · Cornell Lab of Ornithology | Tom Denton, Hamer, Wang et al. · Google Research | David Robinson, Marius Miron, Masato Hagiwara et al. · Earth Species Project |
| Architecture | EfficientNet-B0 backbone, mel-spectrogram input | EfficientNet-B3, ~12M params, multi-task heads | BEATs audio encoder + Llama-3.1-8B-Instruct language model |
| Training corpus | ~3,000 species, ~80,000 hours Xeno-canto + Macaulay | Xeno-canto + Macaulay (curated bioacoustic) | Curated bioacoustic mix + audio-language pairs |
| Output type | Species classification + 1,024-dim embedding | Classification + (5, 3, 1,536) embedding | Natural-language answers (zero-shot Q&A) |
| Embedding dim | 1,024 | 1,536 (mean-pooled) | Token-level, not a fixed pooled vector |
| Strength on corvids | Workhorse · fast · biased toward species-discriminative features | Best within-species detail · current SOTA for graded-call work | Best for natural-language queries · zero-shot captioning |
| Inference cost | CPU-friendly · runs on phones | GPU-friendly · ONNX port for edge | Heavy · needs an A10G or A100 in practice |
| Open license | MIT (open source) | Open weights (Hugging Face) | Open weights (Hugging Face) |
| Best for | Real-time detection · long-term monitoring · phone apps | Similarity search · within-species clustering · CrowLingo atlas | Plain-English Q&A about clips · BEANS-Zero benchmark |
When to use which
Three jobs, three models.
- Detection at scale → BirdNET
- When you have hours of audio and want to find every crow call, BirdNET's combination of speed, CPU support, and the massive Xeno-canto training corpus makes it the default. It's why Merlin runs BirdNET and why every long-term acoustic monitoring deployment from passive recorders does too.
- Within-species detail → Perch 2.0
- Once you have crow calls isolated, Perch 2.0's embedding geometry captures graded variation BirdNET smooths over. This is what a production version of CrowLingo's similarity search (“find calls like this one”) would run on. Open weights, ONNX port for edge.
- Plain-English Q&A → NatureLM-audio
- When the question is “what species is this” or “how many individuals are calling” or “describe this clip,” you don't want an embedding, you want an answer. NatureLM-audio gives you one zero-shot. Heavy compute, niche task, but irreplaceable when you need it.