Skip to content
CrowLingo

Navigate CrowLingo

Jump to any page. Type to filter.

Frontier · Sub-page

Open dataset.

Not yet shipped. Honest about it. The plan and the timeline are below — and they're a forecast, not a promise.

What we plan to ship

A curated open corpus of crow vocalizations — initial target 10,000+ labeled calls — released under CC-BY-NC on Hugging Face Datasets. Per-entry metadata: timestamp, city-coarsened location, behavioral context (where confirmed), individual ID (where confirmed), recording rig, encoder used for the reference embedding, and license. Each entry ships with the raw WAV at 48 kHz / 24-bit.

Why CC-BY-NC, not CC-BY

The non-commercial restriction is a hedge against the same failure mode the ethics page names: commercial "crow translator" products built on top of training data we wouldn't want to be associated with. CC-BY-NC keeps the data open for research, classrooms, and citizen science while requiring case-by-case licensing for revenue-bearing use.

Where the data will come from

  • Curated subsets of existing CC-licensed material from Macaulay Library and xeno-canto.
  • Field recordings contributed by collaborating researchers (consented, license-clean, behavior-logged).
  • Citizen-scientist submissions through the v3 upload pipeline (see Contribute).

Why this isn't available in v0

Three reasons, all real. First, the labeling discipline the dataset deserves is slow — we'd rather ship 10k high-quality calls in v2 than 50k noisy calls in v1. Second, the license vetting on third-party material is painstaking, and we will not publish anything whose redistribution rights we haven't confirmed. Third, the ethics floor on alarm/distress audio means parts of the corpus can't ship in trivially redistributable form at all — which constrains the format and adds release engineering.

Timeline (honest forecast)

  • v0 (now): no data, this page.
  • v1: a small (~1k call) preview corpus from CC-licensed sources, no individual IDs, behavior tags from secondary sources only.
  • v2: the 10k corpus. CC-BY-NC. Hugging Face.
  • v3+: citizen-scientist contributions integrated, individual-ID tracking, multi-year longitudinal subsets.

How to be notified

No newsletter modal, no email harvest. When v1 ships, the announcement appears on the home page and on this page. Subscribe to the GitHub repo's releases for an automated ping.