Stage 01

Record — capture the call without breaking the scene.

Field-recording discipline pays compound interest. Every model downstream sees the audio you captured; preprocessing only subtracts.

AI narration · Stage 1 — Capture

Every vocal map starts with a microphone in front of a crow. Most public corpora — Wikimedia, Xeno-canto, the Macaulay Library — were built with handheld parabolic mics aimed at calling birds, recorded at 44 or 48 kilohertz. That's enough bandwidth to cover the full crow vocal range, from the 200-hertz bottom of the loudest caws to the 8-kilohertz top of juvenile begging. The harder problem isn't sample rate; it's behavioral context. A recording without notes — where, when, what the bird was doing, whether it was alone or with kin — is acoustically usable but semantically thin. That's why the wearable-logger studies of the 2020s, Demartsev's carrion crow work especially, are quietly revolutionary.

The minimum viable rig

A directional shotgun microphone, a portable recorder, and a phone running a timestamped behavior log. Anything beyond that is optimization. Anything less and you start losing data you can't recover.

Mic: Sennheiser ME66/MKE600-class shotgun or a Røde NTG-2. Foam + furry windscreen. Off-axis rejection matters more than absolute sensitivity.
Recorder: Zoom H1n or H5 at 48 kHz / 24-bit, mono, lossless WAV. Phone-only is acceptable but a real recorder gives you preamp headroom and accurate clock.
Behavior log: phone app that timestamps events. Keep it boring: who, where, what behavior, who-with.
Sync:at session start, clap once near the mic with the phone visible. You'll align audio and log at stage 6 by that clap.

Sample rate, bit depth, channels

48 kHz / 24-bit, mono. Crow energy lives 200 Hz – 8 kHz; 48 kHz captures everything with comfortable headroom for any downstream filter. 24-bit preserves quiet calls without quantization noise eating the spectral grain. Mono is the right choice — stereo doubles your file size for no per-clip benefit, and most models embed mono anyway.

The behavior log schema

Keep it append-only and structured. CSV or JSONL. Each row is a timestamped observation:

time         crow_id   behavior          partner   notes
13:42:01.4   AC-04     territorial-caw   --        from cottonwood, faces east
13:42:08.2   AC-04     mobbing-attempt   AC-04+    pair joined, target redtail
13:42:36.7   AC-07     food-discovery    --        new bird, lower branch
13:43:02.0   AC-04     companion-call    AC-04     contact w/ mate
13:43:18.9   --        rain-onset        --        recording cut short

Use stable crow IDs (banded if possible, descriptor like "left-eye-mark" otherwise). Resist the urge to interpret in the log — "long-caw, head-low, tail-flicked" is data; "agitated" is interpretation that belongs in analysis.

Synchronization in practice

The classic clap-sync works because it puts an audible transient on the audio track and a timestamped event in the log at the same instant. At stage 6 you'll subtract the clap time from the recorder timestamp to compute the offset; everything before the clap is unusable, everything after is aligned to ±50 ms.

For longer sessions, clap every 20 minutes. Recorder clocks drift; logs drift too. Multiple clap-sync points let you correct for both.

Isometric illustration of a minimal field-recording rig: directional shotgun mic, field recorder, smartphone behavior log. — IG · 07 · FIELD · RIG
The rig in isometric: directional mic, field recorder, behavior-log phone, distance floor.

bashCapture command with metadata

# 48 kHz / 24-bit mono with location + recordist metadata
ffmpeg -f avfoundation -i ":1" -ar 48000 -ac 1 -c:a pcm_s24le \
  -metadata title="Crow session" \
  -metadata location="Discovery Park, Seattle (city-coarsened)" \
  -metadata recordist="J. Field" \
  -metadata coords="47.66/-122.41 (coarse)" \
  capture_$(date +%Y%m%d_%H%M%S).wav