Stage 04
Embed — encoder choice is the third axis of every analysis.
The 1,024 numbers a model produces for your clip aren't a property of the clip. They're a property of the clip and the encoder. Treat both as variables.
AI narration · Stage 4 — Embed
This is the part that actually changed in 2023. Older bioacoustic pipelines hand-crafted features — mean frequency, peak amplitude, formant structure — and called that an acoustic profile. Foundation models don't. Perch 2.0 and NatureLM-audio take the raw waveform, run it through a transformer trained on millions of unlabeled wildlife recordings, and emit a 1,024-or-1,536-dimensional vector. The geometry of that vector space carries far more information than any hand-engineered profile ever did. Two crow calls close together in the embedding sound similar; calls in distant regions sound different in ways even human listeners might struggle to articulate. That geometry is the substrate for everything that follows.
The four encoders worth knowing
For crow audio specifically, four pretrained encoders cover the interesting cases:
- BirdNET embeddings — fast, mature, biased toward species-discriminative features. Excellent for "is this a crow at all," less ideal for fine-grained within-crow structure. Docs: birdnet.cornell.edu ↑
- Perch 2.0 (Google, EfficientNet-B3 lineage) — broader audio coverage, finer within-species detail, ONNX port available. The default for graded-call work in 2024–26.
- CLAP — joint audio-text. Useful if you want to query a corpus with natural-language prompts (“low territorial caw with rasp”) but expect rougher within-call geometry than Perch.
- NatureLM-audio — Earth Species Project's audio-language foundation model for bioacoustics, ICLR 2025. SOTA on BEANS-Zero; supports zero-shot captioning. Heavier compute. Weights: EarthSpeciesProject/NatureLM-audio ↑
What it means to “live in a different space”
Embeddings from different encoders are not comparable. The cosine distance between two BirdNET vectors is meaningful; between a BirdNET vector and a NatureLM-audio vector, it is noise. Even fine-tuning the same base model on different downstream tasks shifts the geometry.
The practical rule: pick one encoder per project, document its version and weights hash, embed everything with it. If you must mix encoders (for example, to backfill old recordings), keep them in parallel namespaces and cross-validate by a small set of double-embedded clips.
Pretrained representations are the new ground truth in bioacoustics. Their biases are now everyone's biases.
from naturelm_audio import NatureLM
model = NatureLM.from_pretrained("EarthSpeciesProject/NatureLM-audio")
# single clip
emb = model.embed("./crow_clip_001.wav") # shape (1024,)
# batch
batch_emb = model.embed_many([
"./clip_001.wav", "./clip_002.wav", "./clip_003.wav",
]) # shape (3, 1024)import pandas as pd, hashlib
df = pd.DataFrame({
"path": paths,
"embedding": [e.tolist() for e in embeddings],
"encoder": "naturelm-audio",
"encoder_version": "v0.3.1",
"weights_sha256": hashlib.sha256(model.weights_bytes).hexdigest(),
"sample_rate": 48_000,
"timestamp_utc": pd.Timestamp.utcnow().isoformat(),
})
df.to_parquet("embeddings.parquet")