Special BriefingMethodology / trust read (one-off; event-triggered thereafter)June 16, 2026

Allegation, Indictment, Ruling — How the Benchmark Scores Accusations vs Proof

In a single fortnight, OpenAI was hit by a 42-state attorney-general subpoena and its score did not move; Oracle's documented severance terms moved it into the Critical band. That is not inconsistency — it is the discipline that keeps the benchmark citable. This briefing examines six entities to show the exact line the record draws between what is alleged and what is proven, and between conduct an institution chose and conduct a government forced on it.

Scope: A curated cross-index cohort of six entities — OpenAI (27.5), Oracle (14.7), UnitedHealth Group (10.2), Anthropic (59.1), Microsoft (65.3), and Hungary (50.2) — chosen because each is a clean test of one rule: how the benchmark separates an *allegation* from an *adjudicated finding* from *documented conduct* from *compelled action*.

Cohort: 6 entities across 3 indexes, each a clean test of one evidentiary rule. · OpenAI 27.5 (Developing) — 42-state AG subpoena, held sub-threshold as allegation-not-adjudicated. · Oracle 14.7 (Critical) — documented coercive-severance terms, moved Developing → Critical. · UnitedHealth 10.2 (Critical) — coordinated multi-AG investigation + DOJ probe, held pre-adjudication. · Anthropic 59.1 (Functional) — government-compelled model shutdown, scored on conduct, held. · Microsoft 65.3 (Established) — compelled human-rights remedy, held sub-threshold. · Hungary 50.2 (Functional) — documented self-initiated recovery (28.1 → 50.2), reaches Functional, not the top.

If you remember one thing

Forty-two attorneys general did not move OpenAI's score; one documented severance clause moved Oracle into Critical. The difference is not the number of accusers — it is the evidentiary stage. An AG subpoena is an allegation under investigation; Oracle's "sign a release or forfeit severance" terms are documented, in-effect conduct. The benchmark prices the second and discounts the first.

Key Findings

Forty-two attorneys general did not move OpenAI's score; one documented severance clause moved Oracle into Critical. The difference is not the number of accusers — it is the evidentiary stage. An AG subpoena is an allegation under investigation; Oracle's "sign a release or forfeit severance" terms are documented, in-effect conduct. The benchmark prices the second and discounts the first.
An allegation is weighed, not ignored — it just cannot move the composite by itself. The OpenAI probe was reconstructed as a −1.6 pressure on Accountability and Awareness; it fell below the 5-point threshold and stayed inside the band. The accusation registers as sub-dimension pressure and as a pre-registered watch trigger, but it does not become a score event until it is adjudicated.
Density of investigations is not adjudicated harm. UnitedHealth faces a coordinated multi-state AG investigation, a DOJ criminal probe, and a shareholder suit at once — and the record holds it at 10.2, explicitly ruling that breadth of investigation raises enforcement *density* but does not cross into proof. The same rule applies to a near-floor insurer and a trillion-dollar AI lab.
Compelled action is scored on conduct, not on the compulsion. When the US government forced Anthropic to suspend two flagship models, the benchmark scored Anthropic's *behavior in the event* — prompt disclosure, an apology, a stated restoration effort — as mildly positive, not as self-inflicted harm. The same doctrine, run in the other direction, holds Microsoft's *compelled* human-rights remedy below a self-initiated one.
The discipline is symmetric — it disciplines good news too. Hungary's documented, self-initiated recovery moved it from a 28.1 baseline to 50.2, a genuine multi-cycle climb. It lands at Functional, not the top of the scale: trajectory and reform-in-progress are credited as conduct, but a top score is reserved for sustained, self-initiated practice — never conferred on momentum or on a forced settlement.
The endpoints of these discounts are where the live risk sits. Each held score carries a pre-registered conversion trigger — an enforcement action, a settlement, an adjudicated finding, a verified operative effect. The score does not move on the accusation; it moves the day the accusation becomes proof.

The field

1,156 entities across the five bands — the full distribution this briefing draws from.

Source: Compassion Benchmark · CC-BY

1. Frame

The Compassion Benchmark is an evidence institution before it is a ranking. Its credibility rests less on any individual number than on a single, boring-sounding discipline: the benchmark distinguishes what is alleged from what is proven, and it scores documented conduct rather than reputation, sentiment, or the volume of accusation. Get that line wrong in either direction and the institution fails — score on accusations and it becomes a rumor mill that any motivated complainant can move; ignore accusations entirely and it becomes a whitewash that misses real harm until it is too late.

This briefing takes that line as its subject. It assembles six entities — across the AI-labs, Fortune-500, and countries indexes — each of which isolates one move in the evidentiary logic:

OpenAI (27.5) — a 42-state attorney-general subpoena that did not move the score. The pure allegation case.
Oracle (14.7) — documented severance terms that did move the score, into Critical. The documented-conduct case.
UnitedHealth (10.2) — a coordinated multi-AG investigation plus a DOJ probe, held pre-adjudication. The enforcement-density case.
Anthropic (59.1) — a government-compelled model shutdown scored on conduct. The compelled-restriction case.
Microsoft (65.3) — a compelled human-rights remedy held below self-correction. The compelled-remedy case.
Hungary (50.2) — a documented self-initiated recovery that reaches Functional, not the top. The symmetric-discipline case.

The central thesis is one sentence: allegations are weighed but discounted until adjudicated; documented conduct moves scores; and compelled action is scored on conduct, not on the compulsion — and the same three rules govern a trillion-dollar AI lab, a near-floor insurer, and a recovering democracy alike. That uniformity is the point. None of these scores is re-examined here; this is an interpretation of why the record reads the way it does.

2. The cohort — six clean tests of one line

Recomputed directly from rankings[] in each index. Every published composite below reconciles exactly with the canonical JSON — no drift.

Entity	Index	Composite	Band	The evidentiary stage it tests	The governing ruling
OpenAI	ai-labs	27.5	Developing	Allegation under investigation	ALLEGATION-NOT-ADJUDICATED
Oracle	fortune-500	14.7	Critical	Documented, in-effect conduct	COERCIVE-SEVERANCE-STRUCTURE
UnitedHealth Group	fortune-500	10.2	Critical	Coordinated investigation, pre-merits	FILED-BUT-UNADJUDICATED
Anthropic	ai-labs	59.1	Functional	Government-compelled restriction	COMPELLED-RESTRICTION-SCORED-ON-CONDUCT
Microsoft	fortune-500	65.3	Established	Government/expose-compelled remedy	COMPELLED-REMEDY-NOT-SELF-CORRECTION
Hungary	countries	50.2	Functional	Documented self-initiated recovery	(trajectory ≠ top score)

These six sort cleanly into a three-stage evidentiary ladder, plus a compelled-conduct axis that cuts across it:

Stage 1 — Allegation (discounted): OpenAI, UnitedHealth. An accusation exists and is credible enough to investigate, but no merits finding has issued. Weighed as sub-dimension pressure and a watch trigger; it cannot move the composite by itself.
Stage 2 — Documented conduct (scored): Oracle, Hungary. The conduct itself is on the record and in effect — Oracle's severance terms, Hungary's enacted (and not-yet-enacted) reforms. This is what the benchmark actually prices.
Stage 3 — Adjudicated finding (the trigger, not yet pulled): none of the six has crossed it on the in-window facts. Each instead carries a pre-registered trigger that would convert an allegation into a score event.
The compelled-conduct axis: Anthropic and Microsoft are scored on what they did when an external actor forced their hand — not on the forcing. This axis is orthogonal: a compelled action can be conduct-positive (Anthropic) or held-below-self-correction (Microsoft).

The single most important structural fact in this cohort: the discount is applied at the same threshold for everyone. OpenAI's 42-state probe and UnitedHealth's multi-AG investigation are both held at exactly the same evidentiary bar as a single Fortune-500 enforcement filing. The benchmark does not let the count of accusers, the prominence of the accused, or the severity of the allegation substitute for adjudication.

3. The allegation discount — why 42 attorneys general did not move OpenAI

OpenAI sits at 27.5 (Developing, rank 42). In the window of June 12–14, 2026, a coalition of 42 state attorneys general, led by New York AG Letitia James, served OpenAI a subpoena four days after its confidential S-1 IPO filing. The subpoena's scope is broad and serious: consumer and health data, marketing to vulnerable populations (children and seniors), age verification, safety-testing policy, and — notably — model sycophancy named as a design flaw. A separate Florida civil suit (June 1) alleges ChatGPT validated a 16-year-old's suicidal ideation and supplied self-harm methods.

The score did not move. The ruling — ALLEGATION-NOT-ADJUDICATED — is explicit about why, and the reasoning is the load-bearing part:

"an AG subpoena/investigation is an allegation under investigation, not an adjudicated finding … allegations carry a Tier discount and do not by themselves move the composite."

Crucially, the allegation was not ignored. The assessor ran a conservative reconstruction as if the allegation were weighted — applying downward pressure on Accountability (1.9 → 1.7, for marketing to vulnerable populations without age verification) and Awareness (2.2 → 2.0, for anticipatory awareness of harm to minors). That reconstruction produced a composite of 25.9 — a delta of −1.6, below the 5-point movement threshold. The probe is therefore recorded as a sub-dimension intensifier within the Developing band, not a band-moving event. It registers; it just does not clear the bar to become a score change.

This is the discipline doing exactly what it is for. Three features distinguish a discounted allegation from a scored one:

It is weighed, then thresholded. The pressure is reconstructed in good faith. The discount is not "ignore the accusation"; it is "the accusation, weighted honestly, does not cross the movement threshold without proof."
It is corroboration, not novelty. The 42-state subpoena materially broadens the single Florida suit already weighed in the June 8 assessment — but breadth that re-confirms a known pattern is not the same as new adjudicated harm.
It carries a pre-registered endpoint. The assessment's Watch section names the exact conversion conditions: "Any AG enforcement action, settlement, or adjudicated finding (vs. the current investigative subpoena) → scorable ACC/INT downgrade." The discount has a defined off-ramp; it is not a permanent shield.

The skeptical reading — "the benchmark let a trillion-dollar lab off the hook" — gets the discipline backwards. The benchmark would be less credible, not more, if 42 signatures on a subpoena could move a score that the same 42 offices have not yet proven in any forum.

4. The documented-conduct contrast — one severance clause moved Oracle into Critical

Set OpenAI beside Oracle, which crossed Developing → Critical (20.6 → 14.7, −5.9) on June 14, applied after founder review. The contrast is the whole argument of this briefing.

Oracle's downgrade did not rest on an accusation, an investigation, or a regulator's complaint. It rested on documented, in-effect conduct — the actual terms of a finalizing 30,000-person layoff (~18% of global headcount, completing June 15, 2026):

"employees must sign a release waiving their right to sue in order to receive any benefit at all." — Tech Times, 2026-06-01

"Oracle included the 60-day WARN notice pay within its existing severance calculation rather than paying it on top of severance." — Tech Times, 2026-06-01

"We are choosing the chips. Anyone whose job is not making the chips run faster for our customers is at risk in this industry." — Larry Ellison, via Tech Times, 2026-06-01

These are not allegations awaiting a finding. They are the operative terms of the severance program itself — a sign-or-forfeit consent structure, WARN notice pay absorbed into (rather than added to) severance, forfeited unvested shares, and leadership explicitly framing margin over workforce. The conduct is the evidence; there is nothing left to adjudicate about what the terms are. The composite moved accordingly, with Boundaries, Accountability, and Integrity all marked down.

The OpenAI / Oracle pairing isolates the variable with unusual cleanliness:

	OpenAI	Oracle
Pressure type	42-state AG subpoena	Documented severance terms
Evidentiary stage	Allegation under investigation	Conduct in effect
Weighted delta	−1.6 (reconstructed)	−5.9 (applied)
Threshold	Below 5pt — held	Above 5pt — crossed
Band	Developing (held)	Developing → Critical
Source tier	News reporting of an investigation	Tier-2 reporting of program terms

The lesson is not that Oracle is "worse" than OpenAI in some absolute sense — they sit in different indexes and are not directly comparable on the bottom band (a point the Floor-and-Critical briefing made at length). The lesson is about evidence stage, holding severity aside: documented conduct is scored; an investigation into possibly-worse conduct is not, until it is proven.

5. Enforcement density is not proof — the UnitedHealth case

If OpenAI is the single-investigation allegation case, UnitedHealth Group (10.2, Critical, rank 445) is the stress test of the same rule: what happens when the accusations pile up?

In June 2026 UnitedHealth faced, simultaneously: a coordinated investigation by multiple state attorneys general, a continuing DOJ criminal probe into Medicare Advantage risk-score inflation, a shareholder lawsuit alleging withheld material information after the Thompson shooting, and multiple federal/state cases over AI-powered claim denials and mental-health coverage failures. A CEO resignation landed in the same window. By any journalistic measure this is a company in crisis.

The score held at 10.2. The ruling answers the exact question a critic would ask:

"does coordinated sovereign escalation cross from filed-but-unadjudicated to scorable? Ruling: NO. A coordinated multi-AG investigation is an investigation, not a merits adjudication, settlement, or charge … the breadth (multi-state coordination) raises enforcement density and corroborates the decade-long upcoding/claim-denial pattern, but density of investigations is not adjudicated harm."

Two refinements make this case more instructive than OpenAI's:

Density ≠ proof. Ten investigations into the same alleged pattern are still zero merits findings. The benchmark explicitly refuses to let aggregation substitute for adjudication — the failure mode where "everyone is investigating them, so it must be true" silently becomes a score.
The discount is not absolution. UnitedHealth is at 10.2 — near the floor — because its documented conduct (the priced ACC 1.125 / EQU 1.25 profile, an entrenched claim-denial and upcoding record) already sits there on prior evidence. The allegation discount governs only the new in-window escalation; it does not launder the company's existing low score upward. The CEO resignation, similarly, is ruled "governance churn, not scorable harm."

The Watch trigger is identical in structure to OpenAI's: "Merits adjudication / settlement of the DOJ MA-fraud probe or multi-AG action → scorable downgrade." A near-floor insurer and a trillion-dollar AI lab are governed by the same off-ramp. That uniformity — the rule does not bend for size, sympathy, or sector — is what makes it a rule rather than a reaction.

6. The compelled-conduct doctrine — scoring behavior, not the force behind it

The third move is the subtlest, and it cuts across the allegation ladder rather than sitting on it. When an external actor — a government, a court, a regulator — forces an institution to act, the benchmark scores the institution's conduct in the event, not the external compulsion. Two entities in the cohort run this doctrine in opposite directions.

Anthropic (59.1, Functional) — compelled restriction, scored conduct-positive. On June 12, 2026 the US Commerce Department issued an export-control directive requiring Anthropic to suspend its Fable 5 and Mythos 5 models for all foreign nationals (including foreign-national staff), citing national security and a reported jailbreak. Anthropic disabled both models for all customers within a day. The methodological treatment is explicit:

"This is a government-mandated shutdown, not a failure of Anthropic's own compassion conduct. The benchmark scores the entity's behavior — recognizing, responding to, and reducing suffering — not external sentiment about a regulatory action. Anthropic's conduct in the event (prompt public disclosure, an apology to customers, disclosure of the government's stated reasoning, a stated effort to restore access) is mildly positive on Accountability/Transparency."

The shutdown could have been read naively as "Anthropic's product is now restricted, therefore downgrade." The benchmark refused that read: the compulsion is not the entity's conduct. Accountability was nudged 3.5 → 3.6 for the transparent handling; the composite held at 59.1. The principle: an institution should not be penalized (or rewarded) for what a government forces on it — only for how it behaves when forced.

Microsoft (65.3, Established) — compelled remedy, held below self-correction. The same doctrine, mirrored at the positive end. Following an external inquiry into Unit 8200's use of Azure for mass surveillance of Palestinian mobile calls, Microsoft terminated that access and created a Human Rights Review Board, strengthened pre-contract review, and added anonymous reporting channels. Genuine, documented reform — yet the ruling (COMPELLED-REMEDY-NOT-SELF-CORRECTION) holds it sub-threshold:

"The Human Rights Review Board's creation IS the compelled, scope-limited, prospective remediation … expose-driven, not self-initiated — AB2 anchor 3 (course-correction-under-pressure), not anchor 4-5. Max defensible uplift remains sub-threshold (+1.3)."

A remedy extracted by exposure is credited — but not as the same thing as a remedy an institution reaches on its own. Microsoft is held at 65.3 with a POSITIVE-WATCH and a pre-registered upgrade trigger: "Verified HR Review Board blocking/exiting a harmful national-security contract → scorable UPGRADE." The compelled remedy earns the watch; only verified operative effect earns the score.

Together, Anthropic and Microsoft establish the doctrine symmetrically: compelled restriction is not self-inflicted harm; compelled remedy is not self-initiated virtue. In both cases the benchmark scores the conduct and discounts the compulsion.

8. Forward view — where the discounts convert

Every held score in this cohort is held pending a trigger. The forward view is therefore a list of the exact events that would turn an allegation into a score change — the points where this discipline gets tested in public.

OpenAI — the subpoena→enforcement line. The single highest-value forward trigger in the cohort. Any AG enforcement action, settlement, or adjudicated finding (as opposed to the current investigative subpoena), or adjudication/settlement of the Florida Adam Raine suit, converts the discounted allegation into a scorable ACC/INT downgrade. The IPO timing (subpoena four days after the confidential S-1) makes a public, near-term adjudication path plausible.
UnitedHealth — the merits line. A merits adjudication or settlement of the DOJ Medicare-Advantage fraud probe or the multi-AG action would move a near-floor score with modest residual headroom. This is the case most likely to test whether the density discount holds under sustained pressure.
Anthropic — the compelled-conduct off-ramp. An adjudicated DC Circuit ruling on the Pentagon supply-chain-risk designation, or enacted audit-mandate compliance/non-compliance, could become scorable in either direction — the cleanest forward test of whether "scored on conduct" holds when the compelled event resolves.
Microsoft — the operative-effect trigger. The compelled HR Review Board converts from sub-threshold remedy to scorable upgrade only on verified operative effect: a documented contract the Board actually blocks or exits. Quiet reinstatement of terminated Unit 8200 access would convert the other way.
Hungary — the self-initiation test. The recovery arc (28.1 → 50.2) is the cohort's control case for what credited conduct looks like. Whether it climbs further toward (and the rule says, only toward, not into) the top band depends on enacted, sustained, self-initiated reform — the structural commitments materializing rather than being announced.

The through-line: none of these scores moves on the accusation. Each moves the day the accusation becomes proof — or the day documented conduct changes. That is the event to watch, and the discipline this briefing documents is precisely the rule that decides which day that is.

Sources

Canonical scores (ground truth): site/src/data/indexes/{ai-labs,fortune-500,countries}.json — the six composites (OpenAI 27.5, Oracle 14.7, UnitedHealth 10.2, Anthropic 59.1, Microsoft 65.3, Hungary 50.2), bands, and ranks were recomputed directly from rankings[] and reconcile exactly with the published values (no drift).
Ruling provenance (assessments): research/assessments/openai-2026-06-15.md (ALLEGATION-NOT-ADJUDICATED; 42-state subpoena; −1.6 reconstruction); research/change-proposals/oracle-2026-06-12.json (COERCIVE-SEVERANCE-STRUCTURE; applied Developing → Critical, founder-approved 2026-06-14); research/assessments/unitedhealth-group-2026-06-09.md (FILED-BUT-UNADJUDICATED / density-not-adjudication); research/assessments/anthropic-2026-06-13.md (compelled Fable 5 / Mythos 5 shutdown, scored on conduct); research/assessments/microsoft-2026-06-09.md (COMPELLED-REMEDY-NOT-SELF-CORRECTION); research/assessments/hungary-2026-04-30.md + site/public/data/history/hungary.json (28.1 baseline → 50.2 recovery arc).
Applied-change record: research/APPLIED_CHANGES.md (2026-06-14 batch — Oracle Developing → Critical, band-count deltas).
Methodology corpus: research/PENDING_CHANGES.md (, , — the pre-registration questions referenced in §7); prior Special Briefings research/special-briefings/ai-governance-2026-06-15.md and research/special-briefings/layoffs-despite-profits-2026-06-15.md (conduct-vs-coercion and pre-adjudication framing).
Primary web evidence (entity events):
OpenAI 42-state subpoena — Tom's Hardware, TechTimes, The Next Web.
Oracle severance terms — Tech Times.
UnitedHealth probes — Healthcare Finance News, Medical Economics.
Microsoft compelled remedy — The National.
Hungary recovery context — HRW (rule-of-law agenda), HRW (ICC road back).

How to read the scores

The 0–100 scale — five bands

Every entity — state, corporation, AI lab, robotics lab, or city — is scored 0–100 across 8 dimensions and 40 subdimensions. The composite score places the entity in one of five bands:

Critical0–20Foundational compassion practices are absent or documented active harm is present.

Developing20–40Some practices are emerging but remain inconsistent, reactive, or unevenly applied.

Functional40–60Core practices exist and meet a basic bar, with significant gaps remaining.

Established60–80Practices are systematic, documented, and supported by consistent evidence.

Exemplary80–100Practices are independently verified, consistent, and sustained under pressure.

The 8 dimensions

Each dimension is scored 1–5 across 5 subdimensions (40 subdimensions total), then converted to a 0–100 composite. A score of 1.0 on a subdimension represents the minimum anchor; 5.0 is exemplary conduct.

AWRAwarenessDoes this entity reliably detect when others are in pain or need — before they name it?

EMPEmpathyDoes this entity genuinely connect with the inner experience of those it serves?

ACTActionDoes compassionate understanding translate into real, proportional, effective help?

EQUEquityIs care distributed fairly — especially toward those with greatest need and least power?

BNDBoundariesIs helping sustainable, ethical, and autonomy-preserving — not dependency-creating?

ACCAccountabilityDoes this entity own its failures, correct course, and make genuine repair?

SYSSystemic ThinkingDoes compassion extend to root causes and structural change — not only symptom relief?

INTIntegrityIs compassion genuine, consistent, and non-performative — especially when it costs something?

Scores are based on public evidence — government reports, regulatory filings, independent audits, judicial findings, and verifiable third-party records. Entities never pay for inclusion, score changes, or suppression of findings. Full methodology

Allegation, Indictment, Ruling — How the Benchmark Scores Accusations vs Proof

Key Findings

The field

1. Frame

2. The cohort — six clean tests of one line

3. The allegation discount — why 42 attorneys general did not move OpenAI

4. The documented-conduct contrast — one severance clause moved Oracle into Critical

5. Enforcement density is not proof — the UnitedHealth case

6. The compelled-conduct doctrine — scoring behavior, not the force behind it

8. Forward view — where the discounts convert

Sources

The 0–100 scale — five bands

The 8 dimensions

Continue reading

America at 250: The Compassion Score of a Founding Promise

Famine as a Scored Event — One Hunger Evidence, Three Different Scores

Introducing the University Index — How We Score Universities on Compassion, Not Prestige

Aid Obstruction — When Institutions Stop Relief and Silence the Witnesses

The Denial Machine — When Coverage Becomes the Harm

The University Index — The Prestige–Compassion Gap

The Equity Tax — The One Dimension That Drags Almost Everyone Down

The Middle of the Scale — What a 50 Actually Means

State of Exception — When Governments Codify Impunity

The State of Institutional Compassion — 2026

What the Product Is For — Robotics and AI at the Harm Frontier

AI Governance Under Pressure — What a Shutdown, a Subpoena, and a Union Vote Actually Tell the Benchmark

Layoffs Despite Profits — When a Layoff Becomes a Compassion Failure

What Good Looks Like — Exemplars Across Entity Types

The Floor and the Critical Band — How the Benchmark Judges the Worst

Weekly score highlights — institutional compassion findings