Stage 01
Record — capture the call without breaking the scene.
Field-recording discipline pays compound interest. Every model downstream sees the audio you captured; preprocessing only subtracts.
The minimum viable rig
A directional shotgun microphone, a portable recorder, and a phone running a timestamped behavior log. Anything beyond that is optimization. Anything less and you start losing data you can't recover.
- Mic: Sennheiser ME66/MKE600-class shotgun or a Røde NTG-2. Foam + furry windscreen. Off-axis rejection matters more than absolute sensitivity.
- Recorder: Zoom H1n or H5 at 48 kHz / 24-bit, mono, lossless WAV. Phone-only is acceptable but a real recorder gives you preamp headroom and accurate clock.
- Behavior log: phone app that timestamps events. Keep it boring: who, where, what behavior, who-with.
- Sync: at session start, clap once near the mic with the phone visible. You'll align audio and log at stage 6 by that clap.
Sample rate, bit depth, channels
48 kHz / 24-bit, mono. Crow energy lives 200 Hz – 8 kHz; 48 kHz captures everything with comfortable headroom for any downstream filter. 24-bit preserves quiet calls without quantization noise eating the spectral grain. Mono is the right choice — stereo doubles your file size for no per-clip benefit, and most models embed mono anyway.
The behavior log schema
Keep it append-only and structured. CSV or JSONL. Each row is a timestamped observation:
time crow_id behavior partner notes
13:42:01.4 AC-04 territorial-caw -- from cottonwood, faces east
13:42:08.2 AC-04 mobbing-attempt AC-04+ pair joined, target redtail
13:42:36.7 AC-07 food-discovery -- new bird, lower branch
13:43:02.0 AC-04 companion-call AC-04 contact w/ mate
13:43:18.9 -- rain-onset -- recording cut shortUse stable crow IDs (banded if possible, descriptor like "left-eye-mark" otherwise). Resist the urge to interpret in the log — "long-caw, head-low, tail-flicked" is data; "agitated" is interpretation that belongs in analysis.
Synchronization in practice
The classic clap-sync works because it puts an audible transient on the audio track and a timestamped event in the log at the same instant. At stage 6 you'll subtract the clap time from the recorder timestamp to compute the offset; everything before the clap is unusable, everything after is aligned to ±50 ms.
For longer sessions, clap every 20 minutes. Recorder clocks drift; logs drift too. Multiple clap-sync points let you correct for both.

# 48 kHz / 24-bit mono with location + recordist metadata
ffmpeg -f avfoundation -i ":1" -ar 48000 -ac 1 -c:a pcm_s24le \
-metadata title="Crow session" \
-metadata location="Discovery Park, Seattle (city-coarsened)" \
-metadata recordist="J. Field" \
-metadata coords="47.66/-122.41 (coarse)" \
capture_$(date +%Y%m%d_%H%M%S).wav