Stage 01
Record — capture the call without breaking the scene.
Field-recording discipline pays compound interest. Every model downstream sees the audio you captured; preprocessing only subtracts.
AI narration · Stage 1 — Capture
Every vocal map starts with a microphone in front of a crow. Most public corpora — Wikimedia, Xeno-canto, the Macaulay Library — were built with handheld parabolic mics aimed at calling birds, recorded at 44 or 48 kilohertz. That's enough bandwidth to cover the full crow vocal range, from the 200-hertz bottom of the loudest caws to the 8-kilohertz top of juvenile begging. The harder problem isn't sample rate; it's behavioral context. A recording without notes — where, when, what the bird was doing, whether it was alone or with kin — is acoustically usable but semantically thin. That's why the wearable-logger studies of the 2020s, Demartsev's carrion crow work especially, are quietly revolutionary.
The minimum viable rig
A directional shotgun microphone, a portable recorder, and a phone running a timestamped behavior log. Anything beyond that is optimization. Anything less and you start losing data you can't recover.
- Mic: Sennheiser ME66/MKE600-class shotgun or a Røde NTG-2. Foam + furry windscreen. Off-axis rejection matters more than absolute sensitivity.
- Recorder: Zoom H1n or H5 at 48 kHz / 24-bit, mono, lossless WAV. Phone-only is acceptable but a real recorder gives you preamp headroom and accurate clock.
- Behavior log: phone app that timestamps events. Keep it boring: who, where, what behavior, who-with.
- Sync:at session start, clap once near the mic with the phone visible. You'll align audio and log at stage 6 by that clap.
Sample rate, bit depth, channels
48 kHz / 24-bit, mono. Crow energy lives 200 Hz – 8 kHz; 48 kHz captures everything with comfortable headroom for any downstream filter. 24-bit preserves quiet calls without quantization noise eating the spectral grain. Mono is the right choice — stereo doubles your file size for no per-clip benefit, and most models embed mono anyway.
The behavior log schema
Keep it append-only and structured. CSV or JSONL. Each row is a timestamped observation:
time crow_id behavior partner notes
13:42:01.4 AC-04 territorial-caw -- from cottonwood, faces east
13:42:08.2 AC-04 mobbing-attempt AC-04+ pair joined, target redtail
13:42:36.7 AC-07 food-discovery -- new bird, lower branch
13:43:02.0 AC-04 companion-call AC-04 contact w/ mate
13:43:18.9 -- rain-onset -- recording cut shortUse stable crow IDs (banded if possible, descriptor like "left-eye-mark" otherwise). Resist the urge to interpret in the log — "long-caw, head-low, tail-flicked" is data; "agitated" is interpretation that belongs in analysis.
Synchronization in practice
The classic clap-sync works because it puts an audible transient on the audio track and a timestamped event in the log at the same instant. At stage 6 you'll subtract the clap time from the recorder timestamp to compute the offset; everything before the clap is unusable, everything after is aligned to ±50 ms.
For longer sessions, clap every 20 minutes. Recorder clocks drift; logs drift too. Multiple clap-sync points let you correct for both.

# 48 kHz / 24-bit mono with location + recordist metadata
ffmpeg -f avfoundation -i ":1" -ar 48000 -ac 1 -c:a pcm_s24le \
-metadata title="Crow session" \
-metadata location="Discovery Park, Seattle (city-coarsened)" \
-metadata recordist="J. Field" \
-metadata coords="47.66/-122.41 (coarse)" \
capture_$(date +%Y%m%d_%H%M%S).wav