Skip to main content

How to read this playbook

The playbook runs end-to-end through the dataset lifecycle. Here is the whole book at a glance — every chapter, grouped into the four phases of building a dataset. Click any chapter to jump straight to it.

You don't have to read it in order. Pick the path that fits where you are:

  • New to dataset design. Start here, then read chapters 2–4 in order: Data Collection, Annotation Design, Data Quality. They build on each other and cover the foundations everyone needs.
  • You already have raw data and want help annotating it. Go to chapter 3 (Annotation Design and Workforce Management), then chapter 4 (Data Quality Assurance and Validation).
  • You're working with a specific modality (speech, multimodal, low-resource scripts). Skip to chapter 5 (Modality-Specific Task Design).
  • You're using LLMs to generate or augment data. Read chapter 7 (LLM-Assisted and Synthetic Data Generation) for the trade-offs and safeguards.
  • You're preparing a dataset for release. Read chapter 6 (Documentation, Data Release, and Governance) and chapter 9 (Dataset Lifecycle Management and Release Checklist).
  • You're offline or on a slow connection. Use Download PDF in the navbar. The whole playbook bundles into one file, rebuilt on every release.
  • You'd rather read in another language. Use the language switcher at the top-right. Translations are community-maintained and grow over time.

Throughout, you'll find practical templates (consent forms, annotation guidelines, governance checklists), worked examples from real African-language projects, and links to datasets and tools you can reuse. New terms are defined in the glossary.

Loading comments…