Best Practices

Tag Design

Building a tag schema that works in practice — and still works in two years.

The tag schema is the hardest thing to get right and the most expensive to change. A schema that's too narrow leaves authors guessing where their data fits; a schema that's too wide leaves them picking between five near-synonyms and picking different ones each time. Either failure mode undermines everything downstream.

This page collects the patterns we've seen work.

Principles

Small is better than complete. A schema that authors can keep in their head and apply consistently is more valuable than a schema that captures every nuance but requires a reference manual. Start with five tags. Add more only when you have specific downstream controls that need them.

Bounded values beat free-form. Every free-form value is a policy-writing problem. origin: "customer" vs origin: "Customer" vs origin: "client" are three different tags to a policy engine. Enumerations force consistency.

Ordered levels where possible. Policies like "at least Confidential" are much easier to write against an ordered scale (public < internal < confidential < restricted) than against unrelated categories. If a tag is conceptually a level, model it as a level.

Match the organization's language. If your organization already uses "Internal Use Only" as a classification term, calling it sensitivity: internal will feel natural. Inventing a new terminology and imposing it invites drift.

Separate what from why. sensitivity (how sensitive is this) is different from regulatory-scope (which regulation applies) is different from business-domain (who owns this). Conflating them produces tags that are hard to reason about.

Patterns that work

The levels-and-scopes pattern

Most organizations converge on something like:

  • Sensitivity (ordered): public, internal, confidential, restricted
  • Regulatory scope (multi-value): pii, phi, pci, cui, itar, or empty
  • Business domain: finance, engineering, legal, customer, hr, operations
  • Retention class: ephemeral, standard, long-term, indefinite
  • Origin: internal-produced, customer-provided, third-party-provided

These five tags answer most of the questions policies actually ask: how sensitive is this, what regulation applies, who owns it, how long must it live, and who originated it.

The project-scoped pattern

For organizations that organize by discrete projects (engineering groups, client engagements, research projects):

  • Everything above, plus
  • Project: an enumerated list of active projects

A policy can then express "this data is specific to Project X and only Project X's members can access it." When projects end, the tag remains for historical evidence; new data stops carrying it.

The coalition pattern

For organizations that share data with external partners under varying agreements:

  • Everything above, plus
  • Releasability: enumerated destinations (internal-only, partner-X, partner-Y, public).

A policy then reasons about whether a specific recipient is in the releasability set. When a new partner joins, the tag schema extends with a new releasability value; existing data is unaffected.

What not to do

Don't encode secrets in tag values. Tags are visible at administrative review. Don't use tag values that themselves reveal sensitive information.

Don't layer classification with project names that change. A project tag of Project-Apollo that is renamed Project-Artemis mid-stream produces a schema-evolution problem. Use stable identifiers.

Don't conflate obligation with classification. An object that must be watermarked when viewed is a policy obligation — it isn't a classification. A tag like watermark-required: true is the wrong tool; an obligation attached to the viewing policy is the right one.

Don't create tags for things the policy engine can compute. The object's age, size, or format are things the system already knows. Don't encode them as tags; reference them directly in policies.

How to validate a draft schema

Before making the schema tenant-live, run it through a validation cycle:

  1. Sample content. Pull a representative sample of real content — pre-classification documents from the existing organization. At least a few hundred objects, ideally a thousand.

  2. Have authors classify. Ask three to five different authors to apply the schema to the same sample, independently. If they agree on 80%+ of objects, the schema is usable. If they agree on 50%, the schema has ambiguities.

  3. Inspect disagreements. Where authors applied different tags to the same object, understand why. Sometimes the object really does straddle categories (fine — the schema lets the author pick one). Sometimes the category definitions are unclear (fix the schema before launch).

  4. Write ten real policies. Take actual governance statements from your organization's policies — the written ones in Word documents — and express them against the schema. If you can't, the schema is missing something.

  5. Test evolution. Imagine adding a new classification value in six months. Does it fit, or does it require re-classifying existing data? An additive change is fine; a re-classification is a migration.

Invest a few weeks in this validation. The schema you emerge with is the schema you'll run for years.

Maintenance

The schema will evolve. Additive changes are cheap — adding a new business-domain value when a new business unit is formed, for example. Restrictive changes are expensive — removing a value that existing data carries.

Review the schema annually. Retire values that haven't been used in a year. Add values that have been reached for repeatedly via free-form workarounds.

The point is not to have the perfect schema on day one. It's to have a schema that can evolve without requiring a re-classification of the entire data estate.