The three axes

Time, frequency, intensity. Time on the horizontal axis: left to right is earlier to later. Frequency on the vertical axis: bottom to top is low pitch to high pitch, measured in Hertz (Hz) or kilohertz (kHz). Intensity as brightness or color: louder energy at a given time-frequency point shows as a brighter pixel. That's it. The whole representation. Every you'll ever look at follows the same three-axis convention; only the color scheme and resolution change. Once the convention clicks, every spectrogram becomes legible.

Different sounds have different frequency distributions.

Why frequency matters

Different sounds have different frequency distributions. A low rumble (truck engine, distant thunder) sits at the bottom of the . A high whistle (bird whistle, smoke alarm, sibilant 's' in human speech) sits at the top. Most natural sounds occupy a range. American crow vocalizations live mostly between 200 Hz at the bottom (the lowest territorial caw fundamentals) and 8 kHz at the top (juvenile begging upper harmonics). Most human speech occupies 80 Hz to 4 kHz. The frequency range a sound occupies tells you a lot about what's producing it.

What harmonics look like

Many natural sounds aren't single frequencies — they're a fundamental frequency plus integer multiples of it (harmonics). On a , a tonal sound with harmonics looks like horizontal stripes stacked on top of each other: the bottom stripe is the fundamental, the next stripe is twice the fundamental's frequency, the next is three times, and so on. Crow caws are tonal: a caw's spectrogram shows a fundamental around 500-1500 Hz with several harmonic stripes above it. The relative loudness of the different harmonic stripes — what bioacousticians call harmonic emphasis — is the most reliable individual-identity signal in American crow vocalizations (Mates[1] et al. 2014).

What noise looks like

Not all sounds are tonal. Some are broadband noise: energy spread across many frequencies without distinct harmonic stripes. A of wind, a crackling fire, or a crow's rattle call shows a fuzzy cloud of energy across the frequency range rather than clean horizontal stripes. The texture of the noise — how broad the energy spread is, how it varies over time — distinguishes different noise sources. The rattle call's mechanical quality shows up as a particular kind of broadband structure that's instantly recognizable once you've seen a few.

How sequences look

A single call shows up as a brief vertical slice with whatever structure the call has. A sequence of calls shows up as multiple vertical slices spaced across the time axis. The spacing between slices — the inter-call interval — is one of the key features that distinguishes call types. Territorial caws are spaced about a second apart, on average. Mobbing sequences pack much tighter, with intervals well under a second. Assembly calls space out more variably. The temporal pattern of a vocalization is often as informative as the within-call spectral structure.

Reading a real crow caw

Look at the territorial-caw exemplar from the CrowLingo atlas's begging or territorial cluster. You'll see: a vertical band of energy roughly 300-400 milliseconds wide (one caw); a fundamental frequency around 700-1000 Hz; several harmonic stripes above it, with the second and third harmonics typically brighter than the fourth and higher; a quick onset, sustained body, and decay back to silence; possibly some spectral noise mixed with the tonal structure at higher frequencies. That's a caw, decoded. Once you can identify a caw, mobbing sequences are caws packed close together with rougher spectral texture, assembly calls are caws with longer inter-call intervals, and so on. The vocabulary builds from one base case.

Why AI uses spectrograms

Audio foundation models — BirdNET[2], , — don't operate on raw audio waveforms. They operate on spectrograms. The model takes a waveform, computes a mel-scaled (a spectrogram with frequency bins logarithmically spaced to match human perception), and processes the spectrogram as if it were an image. The architectural choices borrow heavily from image classification: convolutional layers, attention mechanisms, position encodings. The reason this works: spectrograms preserve all the time-frequency information humans rely on to distinguish sounds, in a 2D format the existing image-recognition machinery already knows how to operate on. The convergence has been productive: bioacoustics inherited two decades of computer-vision research the moment the field switched to spectrogram representation.