Stage 03
Preprocess — clean what you got, conservatively.
Self-supervised models were trained on dirty audio. Heavy denoising removes the grain they learned to read. Light, reversible, transparent — the rule.
AI narration · Stage 3 — Preprocess
Once a clip is isolated, it needs to look like the training distribution of the model that will embed it. That means a bandpass filter to drop traffic and HVAC rumble below 150 hertz, a mild high-shelf cut above 10 kilohertz to suppress hiss, peak normalization to a target loudness so a loud caw and a quiet rattle land at comparable amplitudes, and chunking to the model's expected window length — five seconds for Perch, three for BirdNET, ten for some BEANS evaluations. None of this is sophisticated. All of it determines whether your downstream embedding clusters are real or artifacts of preprocessing variance.
The minimum recipe
- Convert to mono if not already.
- Bandpass 200 Hz to 8 kHz to drop low rumble and ultrasonic content.
- Peak-normalize to about −3 dBFS (don't loudness-normalize).
- If the noise floor is audibly present, apply light non-stationary denoise.
- Save the original. Save your processed. Save the processing parameters.
Why peak, not loudness
Loudness normalization rescales each clip toward a target LUFS, which means a quiet companion call and a loud territorial caw arrive at the model at the same perceived level. Context-relative loudness encodes urgency and proximity — both signal-bearing. Peak-normalizing keeps the relationship intact while making sure no clip is too quiet to embed cleanly.
Denoise carefully or not at all
The noisereducelibrary's stationary mode assumes a uniform noise profile and will eat broadband transients like wing-rustle that may matter for context. Non-stationary mode with a low prop_decrease (around 0.6) is safer. If you can hear the call clearly without denoising, ship the call without denoising.
Resampling decisions
Most encoders want 16 or 32 kHz. Resample at embedding time, not at preprocessing time — keep the 48 kHz source and let the encoder do its own resample with its preferred filter. Otherwise your data has a resample fingerprint baked in.
import librosa
import numpy as np
import noisereduce as nr
def preprocess(path, out):
y, sr = librosa.load(path, sr=48_000, mono=True)
# bandpass 200 Hz – 8 kHz via FFT mask
Y = np.fft.rfft(y)
freqs = np.fft.rfftfreq(len(y), 1 / sr)
Y[(freqs < 200) | (freqs > 8_000)] = 0
y = np.fft.irfft(Y)
# light non-stationary denoise
y = nr.reduce_noise(y=y, sr=sr, stationary=False, prop_decrease=0.6)
# peak-normalize to -3 dBFS
y = librosa.util.normalize(y) * 0.707
librosa.output.write_wav(out, y, sr)