The three models at a glance

BirdNET[1], released by Stefan Kahl and colleagues at Cornell Lab of Ornithology in 2021 and currently in version 2.4 (2024), is the workhorse species detector. It runs on phones via the Merlin app, on Raspberry Pi nodes for passive acoustic monitoring, and on cloud servers for batch analysis. EfficientNet-B0 backbone, 1,024-dim embeddings, multi-species coverage across roughly 3,000 species. , released by Google Research in 2025, is the within-species detail model. EfficientNet-B3 backbone with about 12 million parameters; a (5, 3, 1,536) that captures graded variation smooths over. Open weights on Hugging Face, ONNX port for edge inference. NatureLM-audio[2], presented by Earth Species Project at ICLR 2025 (Robinson, Miron, Hagiwara[3], Pietquin and colleagues), is the audio-language model. BEATs audio encoder plus Llama-3.1-8B-Instruct language backbone. Answers natural-language questions about audio clips zero-shot. State of the art on the BEANS-Zero benchmark.

Real-time detection at scale on consumer or edge hardware.

When to reach for BirdNET

Real-time detection at scale on consumer or edge hardware. The Merlin app on a phone. Long-term passive acoustic monitoring on a Raspberry Pi. Batch processing of hundreds of hours of audio with limited compute. BirdNET[1] is fast — sub-second inference on CPU — and its species coverage is broad enough that 'is there a bird in this recording, and probably what species' is the right framing for most use cases. The trade-off is granularity: embeddings are biased toward species-discriminative features, which means they smooth over the within-species variation that matters for individual ID, dialect work, and behavioral context classification. If your question is taxonomic, BirdNET wins. If your question is graded, look elsewhere.

When to reach for Perch 2.0

Within-species detail. Similarity search ('find calls like this one'). Individual identification. Behavioral cluster discovery. The 2025 release was specifically designed to address BirdNET[1]'s smoothing problem: its geometry captures finer variation in pitch contour, harmonic emphasis, spectral grain, and temporal structure than any open-source predecessor. The (5, 3, 1,536) embedding shape means each clip gets three pooled vectors across five time windows, which preserves more temporal information than a single mean-pooled vector. For mapping the geometry of one species' repertoire — the CrowLingo problem — Perch is the published state of the art in early 2026. Trade-off: it's larger (about 12M parameters), needs a GPU for fast batch inference, and its species coverage is narrower than 's.

When to reach for NatureLM-audio

Natural-language Q&A about audio clips, zero-shot. 'What species is this?' 'How many individuals are calling?' 'What life stage is this animal?' 'Describe this clip in two sentences.' NatureLM-audio[2] answers these without task-specific fine-tuning. SOTA on BEANS[3]-Zero, the Hagiwara-et-al benchmark designed precisely to evaluate zero-shot bioacoustic question-answering. Trade-off: it's heavy. The Llama-3.1-8B language backbone needs an A10G or A100 GPU for reasonable inference latency. Output is a natural-language string, not a fixed-size vector, which means similarity search and projection don't directly apply. Use for analytical questions about clips, not for -geometry work.

Combining them

The most productive 2026 bioacoustic pipelines use more than one of these models. A common pattern: BirdNET[1] for detection (finding the crow in the long recording), for (mapping the geometry of the found calls), and NatureLM-audio[2] for question-answering at the cluster level ('describe what this cluster sounds like, in plain English, given these exemplar clips'). The cost of running all three is dominated by 's GPU bill; the value depends on how rigorously you need each function. For a research project mapping one species's repertoire, the + Perch pair is usually enough. For a public-facing product where users ask natural-language questions, NatureLM-audio earns its keep.

The honest comparison table

Year released: BirdNET[1] v2.4 in 2024 (original 2021), in 2025, NatureLM-audio[2] at ICLR 2025. Architect: Kahl/Cornell, Denton-Hamer-Wang/Google Research, Robinson-Miron-Hagiwara[3]-Pietquin/Earth Species Project. Architecture: EfficientNet-B0 / EfficientNet-B3 (~12M params) / BEATs + Llama-3.1-8B-Instruct. Output: classification + 1024-dim / classification + (5,3,1536) embedding / natural-language. Strength on corvids: workhorse detection / best within-species detail / best for natural-language queries. Inference cost: CPU-friendly / GPU-friendly with ONNX edge port / heavy, needs A100. License: MIT open source / open weights on Hugging Face / open weights on Hugging Face. Best for: real-time monitoring / similarity search and clustering / plain-English Q&A.

What's coming next

The 2026-2027 horizon: tighter integration between the models and the language models. Imagine a single model that produces both a fixed-size 1,024-dim vector (for similarity search and clustering) AND a natural-language description (for Q&A). Earth Species Project and Google Research have both telegraphed that this is the direction. There's also the open question of whether self-host vs. hosted-API economics will shift. Currently all three of these models are open-weights and self-hostable, which is good for science but expensive for hobbyist deployment. A hosted-API ecosystem comparable to OpenAI's for language models would lower the barrier to entry but introduce new dependencies.