Skip to content
CrowLingo

Methods · Compare

BirdNET vs Perch 2.0 vs NatureLM-audio.

Three audio foundation models dominate bioacoustics in 2026. BirdNET is the workhorse detector that runs on phones. Perch 2.0 is the SOTA within-species embedding model from Google Research. NatureLM-audio is the Earth Species Project's audio-language model that answers natural-language questions about clips. Pick by the task, not the brand.

AI narration · Methods · NatureLM-audio

NatureLM-audio is the Earth Species Project's audio-language foundation model, presented at ICLR 2025. It combines a BEATs audio encoder with a Llama-3.1 language backbone, trained on a curated bioacoustic corpus drawn from Xeno-canto, the Macaulay Library, and other open archives. Crucially, it answers natural-language questions about audio clips zero-shot: 'What species is this?', 'How many individuals are calling?', 'What life stage?'. SOTA on BEANS-Zero, the benchmark Masato Hagiwara's group built to evaluate exactly that capability. The atlas's NatureLM-audio attribution is not branding — it's the model whose embeddings would back a production version of our similarity search.

AI interpretation, not translation.

The comparison

Architecture, training, output, license, when to use which.

DimensionBirdNETPerch 2.0NatureLM-audio
Year released2021 (v1) · 2024 (v2.4)2025 (v2.0)ICLR 2025
Architect / orgStefan Kahl et al. · Cornell Lab of OrnithologyTom Denton, Hamer, Wang et al. · Google ResearchDavid Robinson, Marius Miron, Masato Hagiwara et al. · Earth Species Project
ArchitectureEfficientNet-B0 backbone, mel-spectrogram inputEfficientNet-B3, ~12M params, multi-task headsBEATs audio encoder + Llama-3.1-8B-Instruct language model
Training corpus~3,000 species, ~80,000 hours Xeno-canto + MacaulayXeno-canto + Macaulay (curated bioacoustic)Curated bioacoustic mix + audio-language pairs
Output typeSpecies classification + 1,024-dim embeddingClassification + (5, 3, 1,536) embeddingNatural-language answers (zero-shot Q&A)
Embedding dim1,0241,536 (mean-pooled)Token-level, not a fixed pooled vector
Strength on corvidsWorkhorse · fast · biased toward species-discriminative featuresBest within-species detail · current SOTA for graded-call workBest for natural-language queries · zero-shot captioning
Inference costCPU-friendly · runs on phonesGPU-friendly · ONNX port for edgeHeavy · needs an A10G or A100 in practice
Open licenseMIT (open source)Open weights (Hugging Face)Open weights (Hugging Face)
Best forReal-time detection · long-term monitoring · phone appsSimilarity search · within-species clustering · CrowLingo atlasPlain-English Q&A about clips · BEANS-Zero benchmark

When to use which

Three jobs, three models.

Detection at scale → BirdNET
When you have hours of audio and want to find every crow call, BirdNET's combination of speed, CPU support, and the massive Xeno-canto training corpus makes it the default. It's why Merlin runs BirdNET and why every long-term acoustic monitoring deployment from passive recorders does too.
Within-species detail → Perch 2.0
Once you have crow calls isolated, Perch 2.0's embedding geometry captures graded variation BirdNET smooths over. This is what a production version of CrowLingo's similarity search (“find calls like this one”) would run on. Open weights, ONNX port for edge.
Plain-English Q&A → NatureLM-audio
When the question is “what species is this” or “how many individuals are calling” or “describe this clip,” you don't want an embedding, you want an answer. NatureLM-audio gives you one zero-shot. Heavy compute, niche task, but irreplaceable when you need it.

Frequently asked

What people ask about this.

What is the difference between BirdNET and Perch 2.0?
BirdNET is a Cornell-built EfficientNet-B0 species classifier optimized for real-time detection; it emits a 1,024-dim embedding biased toward species-discriminative features. Perch 2.0 is a 2025 Google Research model with an EfficientNet-B3 backbone, ~12M parameters, and a (5, 3, 1,536) embedding that captures finer within-species variation.
Which bioacoustic model is best for crows?
Perch 2.0 is the strongest published model for crow work in 2026: its embedding geometry captures within-species detail better than BirdNET, and unlike NatureLM-audio it returns a fixed-size vector suitable for similarity search and UMAP projection.
Is BirdNET still relevant in 2026?
Yes. BirdNET remains the dominant deployment model for long-term acoustic monitoring, phone-based citizen science (Merlin), and detection-first pipelines where Perch's compute overhead isn't justified.