Skip to content
CrowLingo

Navigate CrowLingo

Jump to any page. Type to filter.

Back to the pipeline

Stage 05

Cluster & label — from points to an interpretable atlas.

A 1,024-dim point cloud is hard to look at. A labeled 2-D map is. The trick is doing the math in the high space and the human-judgement work in the low one.

Project for the eye, cluster for the model

UMAP and PaCMAP collapse 1,024 dimensions to 2 for visualization. They preserve local neighborhoods reasonably well, global structure approximately. Use them to seeyour dataset. Don't run distance-based downstream analysis on the 2-D projection — it's a summary, not a ground truth.

HDBSCAN as default

HDBSCAN doesn't require a target cluster count, handles density variation, and labels low-density points as noise (−1). For crow embeddings that's usually right: dense clusters where the encoder finds repeatable structure, noise points where it doesn't.

Run HDBSCAN on the 1,024-dim space, not the 2-D projection. Tune min_cluster_size by the smallest behaviorally meaningful group you expect to detect — usually 10–30 for crow repertoires under 20k calls.

Labeling: humans on exemplars, not on every clip

For each cluster, audition 5–10 randomly sampled exemplars. Listen first, look at the spectrogram second. Name the cluster with a short noun phrase ("loud territorial caw", "rattle complex"), not a behavior ("territorial"). Behavior belongs in the join at stage 6.

The atlas, refined

Iterate: cluster → audition → rename → merge or split → re-cluster if needed. A handful of human-named clusters covering 80% of your points is the realistic goal. The remaining 20% is graded variation, dialect, or genuine noise — keep it visible on the map, don't prune it.

pythonProject + cluster, end-to-end
import umap, hdbscan, numpy as np

emb = np.load("embeddings.npy")   # shape (n, 1024)

# 2-D projection (for visualization)
proj = umap.UMAP(
    n_neighbors=15, min_dist=0.0, metric="cosine", random_state=42
).fit_transform(emb)

# clustering on the full space (for downstream math)
labels = hdbscan.HDBSCAN(
    min_cluster_size=20,
    metric="euclidean",
    cluster_selection_method="leaf",
).fit_predict(emb)

# audit: cluster sizes + noise share
import pandas as pd
counts = pd.Series(labels).value_counts().sort_index()
print(counts)
print(f"noise share: {(labels == -1).mean():.2%}")
pythonAudition exemplars per cluster
import random
from IPython.display import Audio, display

for cid in sorted(set(labels) - {-1}):
    members = np.where(labels == cid)[0]
    sample = random.sample(list(members), min(5, len(members)))
    print(f"\ncluster {cid:02d} — {len(members)} clips")
    for idx in sample:
        display(Audio(paths[idx]))