linguistic-ethics

Overview

linguistic-ethics enforces the social and legal obligations that surround language data — obligations that a technically valid license does not automatically satisfy. It applies the CARE principles (Collective benefit, Authority to control, Responsibility, Ethics) alongside FAIR, and manages Free Prior Informed Consent (FPIC) for Indigenous and endangered-language data.

A good engineer can build a tokenizer, mine bitext, and fine-tune a model. None of that protects against training on a dataset whose community didn't consent to model use, releasing a model that generates sacred Indigenous content without permission, or stripping attribution lineage during a dataset merge. These are the high-cost mistakes — they damage communities, harm professional reputation, and carry increasing regulatory consequences (EU AI Act, Indigenous data sovereignty laws).

This skill is routed by the orchestrator twice: early in Scope as an awareness seed, and again at Release as the final gate. It is also invoked per-dataset during Acquire — every dataset that enters the mix crosses an ethics boundary, even open-licensed ones.

Pipeline Position

This skill operates in Phase 1 — Acquire (early seed at Scope, and final gate at Release).

Preceding skills: linguistic-scope (provides vitality/EGIDS status that sets ethics depth required) Following skills: linguistic-corpus, linguistic-bitext (only after per-dataset clearance); release decision after final gate

When It Activates

Any new dataset being considered for training or eval — before download
Endangered or Indigenous language data of any kind
Religious or sacred text use (Bible-NLP, Quranic, Vedic, Indigenous oral histories)
License audit before release (open / community-gated / restricted decision)
Attribution and provenance tracking design
Drafting a model card's Ethics, Limitations, and Intended Use sections
Routing decisions involving community-controlled archives (DELAMAN, ELAR, AILLA, PARADISEC)

When NOT to use: the dataset is your own English-only synthetic data with no community attribution issues, and the operation is a pure technical refactor with no data implications. Even then, ask once — under-using ethics is the modal failure mode.

What It Does

CARE vs FAIR — a dataset can be fully FAIR (standardized, downloadable, openly licensed) and still violate CARE:

CARE Principle	Meaning
Collective benefit	Does this serve the source community?
Authority to control	Community decides terms
Responsibility	For harms downstream
Ethics	Through engagement, not just consent

FPIC requires all four components: Free (without coercion), Prior (before data is used), Informed (community understood what models trained on this data could do), Consent (affirmative; can be withdrawn). FPIC is process, not document.

License compatibility for dataset mixes:

Any CC-BY-NC in the mix → entire model is non-commercial-use only
Any CC-BY-SA in the mix → model output must propagate ShareAlike
ND blocks derivative use; if mixed, terms already violated

Sacred-text decision framework (not a hardcoded blocklist):

Example	What's restricted	Why
Quranic text	Generation/transformation	Religious community standards
Indigenous oral histories	Public release; transformation	Custodian permission required
Sami yoik recordings	Non-Sami contexts; commercial	Cultural ownership; Sami Council
Aboriginal Australian songlines	Recording, distribution, model use	ICIP protocols
Bible-NLP / liturgical text	Commercial training; canonical distortion	Community use norms

Release modes:

Mode	Requirements
Open	All-open licenses + attribution complete + no community restrictions + standard model card
Community-gated	Community sign-off; access criteria + revocation path; model card cites partner
Restricted	Use-policy + access controls; legal review

Example Usage

Dataset: Bible-NLP Yoruba (CC-BY 4.0)

## Ethics Assessment: Bible-NLP Yoruba Corpus

**Source(s):** Bible-NLP project
**License(s):** CC-BY 4.0
**License compatibility:** OK for open; flag in commercial mix
**CARE check:** NEEDS-WORK — liturgical register >60%;
    community norms prefer non-commercial generative use
**FPIC required?** NO (CC-BY + EGIDS 2 Provincial)
**Sacred-text concerns:** Bible-NLP — flag in model card; limits commercial generation
**Attribution registry status:** COMPLETE
**Recommended release mode:** OPEN (with model card noting register + use norms)
**Outstanding actions:** Limit Bible % in mix to ≤30%; add register-drift warning to model card

linguistic-scope — provides vitality status for ethics depth
linguistic-corpus — routes every dataset through ethics before adding to mix
linguistic-bitext — per-source ethics check before adding to parallel mix
linguistic-eval — routes to ethics for release-mode decision after eval

Was this page helpful?

Overview

Pipeline Position

When It Activates

What It Does

Example Usage

Related Skills

On this page