Stage 05
Cluster & label — from points to an interpretable atlas.
A 1,024-dim point cloud is hard to look at. A labeled 2-D map is. The trick is doing the math in the high space and the human-judgement work in the low one.
Project for the eye, cluster for the model
UMAP and PaCMAP collapse 1,024 dimensions to 2 for visualization. They preserve local neighborhoods reasonably well, global structure approximately. Use them to seeyour dataset. Don't run distance-based downstream analysis on the 2-D projection — it's a summary, not a ground truth.
HDBSCAN as default
HDBSCAN doesn't require a target cluster count, handles density variation, and labels low-density points as noise (−1). For crow embeddings that's usually right: dense clusters where the encoder finds repeatable structure, noise points where it doesn't.
Run HDBSCAN on the 1,024-dim space, not the 2-D projection. Tune min_cluster_size by the smallest behaviorally meaningful group you expect to detect — usually 10–30 for crow repertoires under 20k calls.
Labeling: humans on exemplars, not on every clip
For each cluster, audition 5–10 randomly sampled exemplars. Listen first, look at the spectrogram second. Name the cluster with a short noun phrase ("loud territorial caw", "rattle complex"), not a behavior ("territorial"). Behavior belongs in the join at stage 6.
The atlas, refined
Iterate: cluster → audition → rename → merge or split → re-cluster if needed. A handful of human-named clusters covering 80% of your points is the realistic goal. The remaining 20% is graded variation, dialect, or genuine noise — keep it visible on the map, don't prune it.
import umap, hdbscan, numpy as np
emb = np.load("embeddings.npy") # shape (n, 1024)
# 2-D projection (for visualization)
proj = umap.UMAP(
n_neighbors=15, min_dist=0.0, metric="cosine", random_state=42
).fit_transform(emb)
# clustering on the full space (for downstream math)
labels = hdbscan.HDBSCAN(
min_cluster_size=20,
metric="euclidean",
cluster_selection_method="leaf",
).fit_predict(emb)
# audit: cluster sizes + noise share
import pandas as pd
counts = pd.Series(labels).value_counts().sort_index()
print(counts)
print(f"noise share: {(labels == -1).mean():.2%}")import random
from IPython.display import Audio, display
for cid in sorted(set(labels) - {-1}):
members = np.where(labels == cid)[0]
sample = random.sample(list(members), min(5, len(members)))
print(f"\ncluster {cid:02d} — {len(members)} clips")
for idx in sample:
display(Audio(paths[idx]))