Overview
Every dataset starts the same way the actual gathering of raw text, images, audio, or video, before anyone has cleaned it, labelled it, or decided what it means. What happens after collection — cleaning, annotation, quality control, and release — is covered in the chapters that follow.
Data Modalities
What data modality means, why it's the first fork in the road for any collection plan, and how it shapes cost, sourcing, and tooling long before anyone writes an annotation guideline.
Data Sources
A map of where raw data for African-language AI systems actually comes from, and how to weigh one source against another before choosing a collection method.
Web Scraping
What web scraping can and can't deliver for African-language data, the legal and ethical considerations, and the access landscape as it stands.
Application Programming Interfaces (APIs)
How to collect data through official APIs, and why building a project's data plan around any single platform's API is riskier than it looks.
Case Study: Multimodal Data Collection
How ArtELingo-28, a 28-language cross-cultural image-emotion benchmark, was actually collected — and the decisions that don't show up in the Modalities, Sources, or API sections until you try to combine them.
Cost and Resource Planning
Learn how to effectively plan the resources required for dataset creation, including budgeting, timelines, and scaling strategies.
Data Cleaning and Preprocessing
Learn how to prepare raw data for use in language AI systems by improving quality, consistency, and usability.
Data Provenance and Traceability
Learn how to track the origin, history, and transformations of your data to ensure transparency, reproducibility, and accountability.
Ethics, Bias, and Governance
Learn how to ensure responsible dataset creation by addressing bias, protecting privacy, and maintaining transparency throughout the data lifecycle.