2. Recolha, Curadoria e Governação de Dados | Waraka Community AfriPlaybook

📄️Overview

Every dataset starts the same way the actual gathering of raw text, images, audio, or video, before anyone has cleaned it, labelled it, or decided what it means. What happens after collection — cleaning, annotation, quality control, and release — is covered in the chapters that follow.

📄️Data Modalities

What data modality means, why it's the first fork in the road for any collection plan, and how it shapes cost, sourcing, and tooling long before anyone writes an annotation guideline.

📄️Data Sources

A map of where raw data for African-language AI systems actually comes from, and how to weigh one source against another before choosing a collection method.

📄️Web Scraping

What web scraping can and can't deliver for African-language data, the legal and ethical considerations, and the access landscape as it stands.

📄️Application Programming Interfaces (APIs)

How to collect data through official APIs, and why building a project's data plan around any single platform's API is riskier than it looks.

📄️Case Study: Multimodal Data Collection

How ArtELingo-28, a 28-language cross-cultural image-emotion benchmark, was actually collected — and the decisions that don't show up in the Modalities, Sources, or API sections until you try to combine them.

📄️Cost and Resource Planning

Learn how to effectively plan the resources required for dataset creation, including budgeting, timelines, and scaling strategies.

📄️Data Cleaning and Preprocessing

Learn how to prepare raw data for use in language AI systems by improving quality, consistency, and usability.

📄️Data Provenance and Traceability

Learn how to track the origin, history, and transformations of your data to ensure transparency, reproducibility, and accountability.

📄️Ethics, Bias, and Governance

Learn how to ensure responsible dataset creation by addressing bias, protecting privacy, and maintaining transparency throughout the data lifecycle.