Ethics and FPIC Guide

Every linguistic AI project crosses ethical boundaries — some obvious (endangered-language community data), some subtle (Bible-NLP register norms, CC-BY-NC in a commercial mix). The linguistic-ethics skill enforces the framework that keeps projects on the right side of those boundaries.

Why Ethics Is A-Tier

The linguistic-ethics skill scores A− (106/120) — the highest in the suite — because ethics failures cascade in ways technical failures do not. A wrong tokenizer choice costs a retraining run. A FPIC violation can:

Require a full data purge and model withdrawal
Damage relationships with communities that are already marginalized by the tech industry
Trigger legal consequences (EU AI Act, Indigenous data sovereignty laws, GDPR-like cultural data protections)
End access to irreplaceable community archives (ELAR, AILLA, PARADISEC)

The cost asymmetry is extreme. linguistic-ethics runs early (Scope seed) and late (Release gate) precisely because catching issues early is orders of magnitude cheaper.

CARE Principles

CARE (Collective benefit, Authority to control, Responsibility, Ethics) was developed specifically for Indigenous and community-held data. It complements FAIR (Findable, Accessible, Interoperable, Reusable) — not replaces it.

CARE	Question to Ask
Collective benefit	Does training on this data serve the source community — or only the model developer?
Authority to control	Has the community been given genuine authority over how their data is used?
Responsibility	Have downstream harms been considered — generative outputs, cultural distortion, commercial use?
Ethics	Is there ongoing engagement — not just a one-time consent form?

A dataset can satisfy all four FAIR criteria and still fail CARE. CC-BY-4.0 gives legal permission to use data. CARE asks whether the community was genuinely consulted about AI training use — a question that was rarely asked when most multilingual datasets were created.

FPIC in Practice

FPIC (Free, Prior, Informed Consent) is the UN-recognized standard for Indigenous and traditional knowledge use. It is frequently misunderstood as "got a signature → done." Real FPIC requires:

Free — No coercion, no economic pressure, no artificial urgency.

Prior — Before the data is used or collected. Retroactive consent is not FPIC.

Informed — The community understood what LLM training means. In 2010 when many field recordings were made, "training a language model" was not a concept most communities could evaluate. Prior informed consent for a different purpose does not cover AI training.

Consent — Affirmative, specific, and revocable. "Silence = consent" is not consent. Communities can withdraw consent.

Practical FPIC for Different Vitality Levels

EGIDS	Status	Required Engagement
0–3	International / National / Provincial	Standard FPIC + license check
4–6a	Educational / Wider Communication / Vigorous	Standard FPIC
6b–7	Threatened / Shifting	Mandatory community pre-engagement BEFORE data acquisition
8a–10	Moribund / Nearly Extinct / Dormant / Extinct	Archive-only; FPIC from descendant community; route to DELAMAN/ELAR

For EGIDS 6b+, route to a partner organization before first contact with raw data: Te Hiku Media for te reo Māori, First Languages Australia, ELAR for many endangered languages.

Sacred-Text and Culturally-Sensitive Material

The linguistic-ethics skill uses a decision framework anchored by five canonical examples. This is not a blocklist — it is a set of reference cases for applying judgment.

The Framework

Source community involvement — Was the community consulted specifically about generative AI use?
Use intent — Is the model for research, commercial deployment, or community use?
Distribution scope — Public release vs. research-only vs. community-gated?
Technical safeguards — Can generations of culturally-sensitive content be filtered or blocked?

Canonical Examples

Quranic text: Training generative models on Quranic text to produce Quranic-style generation violates religious community standards even when the text is in the public domain. The text itself can be used for non-generative tasks (translation benchmarks, corpus statistics) under standard license terms. Generative use requires community sign-off.

Indigenous oral histories: Many oral history recordings in ELAR and similar archives were deposited by community members for preservation purposes — not for AI training. The deposit license may be permissive, but the community intent was preservation. Treat as community-gated unless the community has explicitly authorized AI training use.

Sami yoik recordings: Yoik (joik) is a sacred musical form in Sami culture. Use outside of Sami contexts, especially commercial training, violates Sami Council cultural ownership guidelines. This is true even for recordings that have been publicly broadcast.

Aboriginal Australian songlines: Songlines encode cultural, geographical, and spiritual knowledge. Recording, distributing, and using them in AI systems requires custodian permission under Indigenous Cultural and Intellectual Property (ICIP) protocols. This is not optional.

Bible-NLP: Bible translations are often CC-BY licensed and technically freely usable. However, many translation communities and supporting organizations prefer that their translations not be used for commercial generative AI (which could produce Bible-sounding text that the community would not endorse). Flag in model cards; avoid commercial-generative use without community check.

License Compatibility for Dataset Mixes

When combining multiple datasets, the most restrictive license governs the combined work:

Dataset A: CC-BY-4.0
Dataset B: CC-BY-NC-4.0   ← most restrictive
Dataset C: CC-BY-SA-4.0
Combined model: CC-BY-NC-SA (non-commercial + ShareAlike)

Common traps:

CC-BY-NC sneaks in via aggregator datasets (CulturaX wraps sources, each with its own license)
CC-BY-SA propagates ShareAlike to model outputs — the model's generated text must be CC-BY-SA
"No license" is not "public domain" — treat as restricted; contact source

The Attribution Registry

Every dataset must have a traceable attribution record:

Source URL or citation
License (with version + date)
Date acquired
Permission record (if FPIC required)
Community contact (if applicable)
Lineage (parent dataset(s) it was derived from)

Never strip attribution lineage when merging datasets. Once lineage is gone, you cannot rebuild it, and you cannot honor propagated obligations. This is a hard requirement — not a best-effort.

Model Card Requirement

Every release includes a model card with:

Datasets + licenses + lineage
Ethics statement (CARE alignment)
Intended uses + restrictions
Limitations and known biases (including register, dialect, and demographic coverage)
Contact for community concerns / opt-out

"We'll add the model card later" is not acceptable. Model cards are part of the release decision — without one, release is incomplete.

Getting Help

For projects involving community-controlled data:

Te Hiku Media — te reo Māori and Indigenous language AI
First Languages Australia — Australian Indigenous language data
ELDP / Hans Rausing Endangered Languages Project — endangered language documentation
DELAMAN network — links to regional archives with expertise

Was this page helpful?

On this page