Overview
linguistic-ethics enforces the social and legal obligations that surround language data — obligations that a technically valid license does not automatically satisfy. It applies the CARE principles (Collective benefit, Authority to control, Responsibility, Ethics) alongside FAIR, and manages Free Prior Informed Consent (FPIC) for Indigenous and endangered-language data.
A good engineer can build a tokenizer, mine bitext, and fine-tune a model. None of that protects against training on a dataset whose community didn't consent to model use, releasing a model that generates sacred Indigenous content without permission, or stripping attribution lineage during a dataset merge. These are the high-cost mistakes — they damage communities, harm professional reputation, and carry increasing regulatory consequences (EU AI Act, Indigenous data sovereignty laws).
This skill is routed by the orchestrator twice: early in Scope as an awareness seed, and again at Release as the final gate. It is also invoked per-dataset during Acquire — every dataset that enters the mix crosses an ethics boundary, even open-licensed ones.
Pipeline Position
This skill operates in Phase 1 — Acquire (early seed at Scope, and final gate at Release).
Preceding skills: linguistic-scope (provides vitality/EGIDS status that sets ethics depth required)
Following skills: linguistic-corpus, linguistic-bitext (only after per-dataset clearance); release decision after final gate
When It Activates
- Any new dataset being considered for training or eval — before download
- Endangered or Indigenous language data of any kind
- Religious or sacred text use (Bible-NLP, Quranic, Vedic, Indigenous oral histories)
- License audit before release (open / community-gated / restricted decision)
- Attribution and provenance tracking design
- Drafting a model card's Ethics, Limitations, and Intended Use sections
- Routing decisions involving community-controlled archives (DELAMAN, ELAR, AILLA, PARADISEC)
When NOT to use: the dataset is your own English-only synthetic data with no community attribution issues, and the operation is a pure technical refactor with no data implications. Even then, ask once — under-using ethics is the modal failure mode.
What It Does
CARE vs FAIR — a dataset can be fully FAIR (standardized, downloadable, openly licensed) and still violate CARE:
| CARE Principle | Meaning |
|---|---|
| Collective benefit | Does this serve the source community? |
| Authority to control | Community decides terms |
| Responsibility | For harms downstream |
| Ethics | Through engagement, not just consent |
FPIC requires all four components: Free (without coercion), Prior (before data is used), Informed (community understood what models trained on this data could do), Consent (affirmative; can be withdrawn). FPIC is process, not document.
License compatibility for dataset mixes:
- Any CC-BY-NC in the mix → entire model is non-commercial-use only
- Any CC-BY-SA in the mix → model output must propagate ShareAlike
- ND blocks derivative use; if mixed, terms already violated
Sacred-text decision framework (not a hardcoded blocklist):
| Example | What's restricted | Why |
|---|---|---|
| Quranic text | Generation/transformation | Religious community standards |
| Indigenous oral histories | Public release; transformation | Custodian permission required |
| Sami yoik recordings | Non-Sami contexts; commercial | Cultural ownership; Sami Council |
| Aboriginal Australian songlines | Recording, distribution, model use | ICIP protocols |
| Bible-NLP / liturgical text | Commercial training; canonical distortion | Community use norms |
Release modes:
| Mode | Requirements |
|---|---|
| Open | All-open licenses + attribution complete + no community restrictions + standard model card |
| Community-gated | Community sign-off; access criteria + revocation path; model card cites partner |
| Restricted | Use-policy + access controls; legal review |
Example Usage
Dataset: Bible-NLP Yoruba (CC-BY 4.0)
## Ethics Assessment: Bible-NLP Yoruba Corpus
**Source(s):** Bible-NLP project
**License(s):** CC-BY 4.0
**License compatibility:** OK for open; flag in commercial mix
**CARE check:** NEEDS-WORK — liturgical register >60%;
community norms prefer non-commercial generative use
**FPIC required?** NO (CC-BY + EGIDS 2 Provincial)
**Sacred-text concerns:** Bible-NLP — flag in model card; limits commercial generation
**Attribution registry status:** COMPLETE
**Recommended release mode:** OPEN (with model card noting register + use norms)
**Outstanding actions:** Limit Bible % in mix to ≤30%; add register-drift warning to model cardRelated Skills
- linguistic-scope — provides vitality status for ethics depth
- linguistic-corpus — routes every dataset through ethics before adding to mix
- linguistic-bitext — per-source ethics check before adding to parallel mix
- linguistic-eval — routes to ethics for release-mode decision after eval
Last updated on
linguistic-tokenize
Audit tokenizer fertility for a target language and recommend SentencePiece config, vocab-extension strategy (FOCUS/OFA/HyperOfa), and byte-fallback policy. Non-negotiable for Class 0–3 languages.
linguistic-corpus
Curate monolingual corpora — catalog awareness (OLDI/CulturaX/MADLAD-400/Glot500/Wikipedia), paragraph-level language-ID, Unicode-safe MinHash deduplication, two-sided contamination audit, register-balance analysis.