What Voxaboxen is
Voxaboxen is an open-source Python library for bioacoustic vocalization segmentation, released by Earth Species Project (ESP) and used in research from multiple labs including the Demartsev[1] et al. 2026 carrion crow paper. The library does one job well: given a long audio recording, identify the temporal boundaries of individual vocalizations in it, with output suitable for feeding into downstream or classification models. The library handles the practical problems of variable signal-to-noise ratios, overlapping vocalizations, species-specific call shapes, and audio file formats. It is not the most sophisticated piece of bioacoustic infrastructure in absolute terms; it is one of the most usable, which has made it the de facto standard for several recent papers.
A long recording — say, an hour of passive acoustic monitoring audio, or a thirty-minute file from a wearable bioacoustic logger — is mostly silence and ambient sound, with vocalizations of interest scattered through it.
Why segmentation matters
A long recording — say, an hour of passive acoustic monitoring audio, or a thirty-minute file from a wearable bioacoustic logger — is mostly silence and ambient sound, with vocalizations of interest scattered through it. Before you can run any audio foundation model on the vocalizations, you have to find them: identify the start and end times of each call, isolate them from background, package them for downstream processing. This was historically a manual or semi-manual step requiring expert ear-time. Modern automated segmentation makes the pipeline tractable at scale. The Demartsev[1] 127,000+ vocalizations dataset would not have been buildable at that scale without automated segmentation; manually annotating that volume of audio is infeasible.
How it works internally
Voxaboxen uses a combination of energy-based detection (finding regions of elevated acoustic energy in expected frequency bands) and learned segmentation (a small neural model trained to identify vocalization boundaries given energy-detected candidates). The combination handles edge cases — overlapping calls, calls embedded in higher noise floors, calls with quiet onset or decay — better than either approach alone. The output is a list of (start_time, end_time, confidence) tuples per detected vocalization, with optional clip extraction for downstream processing. The whole pipeline runs faster than real-time on a modest CPU.
Why Earth Species Project released it open
ESP's open-tooling strategy is consistent across their research outputs. NatureLM-audio[2] weights are on Hugging Face. The BEANS[3] benchmark is publicly available. Voxaboxen is on GitHub with an open license. The strategy reflects ESP's mission framing: AI bioacoustics is more useful when the research community can build on shared infrastructure than when each lab maintains proprietary tooling. The strategy also has a self-interest dimension — researchers using ESP's tools cite ESP's papers, which raises the visibility of the underlying work and feeds funding decisions. The mission-and-self-interest alignment has produced consistently open infrastructure releases, and the field is better for it.
What it doesn't do
Voxaboxen segments vocalizations in time. It doesn't classify what species or call type the vocalizations are; that's the foundation-model layer downstream. It doesn't attribute vocalizations to individual animals; that requires either source-localization hardware (microphone arrays) or wearable loggers. It doesn't handle multi-channel audio well (it's designed for mono recordings). It doesn't replace human ear-time entirely — segmentation outputs still benefit from spot-checks against expert review, especially in noisy recording conditions. The library is one piece of the pipeline, not the whole thing.
How it changed what's tractable
Before tools like Voxaboxen, building a research dataset of, say, ten thousand annotated crow vocalizations required either a paid annotation team or several PhD students' worth of labeling labor. After Voxaboxen, building that same dataset became overnight-batch-process work. The temporal cost of bioacoustic research dropped by roughly an order of magnitude, which means more researchers can attempt more ambitious questions. The 2026 Demartsev[1] paper is one consequence of this shift; the multiple-species comparative studies that 2027-2028 will probably bring are downstream of the same infrastructure change.
Open infrastructure as research strategy
The broader pattern Voxaboxen illustrates: in fields where data acquisition is the bottleneck, open shared infrastructure produces more progress than the same investment in proprietary tooling. Bioacoustics has converged on this pattern faster than many adjacent ML-applied-to-science fields. The labs that release tools — ESP, Cornell, Google Research on — accumulate citation networks and downstream-paper output that more than compensates for any competitive advantage they'd have gained from keeping the tools internal. CrowLingo's project benefits from this shift in a direct way: every claim our atlas makes is downstream of methodological work that's been open-sourced into the broader research community.