Skip to content
CrowLingo

PILLAR II — THE METHODS

Self-supervised audio, from scratch.

The models that make CrowLingo possible: BEATs, Perch, NatureLM-audio. They never saw a label. They learned acoustic structure from millions of unlabeled wildlife recordings, then we asked them about crows.

dimensions
1024
UMAP
2D
density
HDBSCAN
no labels
SSL

See the full pipeline.

Eight stages, from field recording to AI captioning.

Open the pipeline →

Frequently asked

What people ask about this.

What's the difference between traditional bioacoustics and ALP?
Traditional bioacoustics hand-crafts features chosen by humans (mean frequency, harmonic-to-noise ratio, etc.) and trains classifiers on those features. Animal Language Processing (ALP) trains audio foundation models on millions of unlabeled recordings via self-supervised objectives, then uses the model's learned embeddings directly. The trade-off is interpretability for raw signal.
What is an audio foundation model?
An audio foundation model is a transformer trained on large unlabeled audio corpora via self-supervised objectives like masked prediction. It produces a learned vector representation of any input clip that captures information richer than hand-engineered features. Examples: BirdNET, Perch 2.0, NatureLM-audio.