Methodology & Framework

How the Compassion Benchmark works

The Compassion Benchmark is a structured methodology for measuring whether an institution reliably detects suffering, understands it, responds effectively, distributes care fairly, respects ethical limits, owns failures, addresses root causes, and behaves with genuine integrity. The formal assessment architecture combines an eight-dimension framework, forty scored subdimensions, a tiered evidence model, adversarial pressure testing, and a human-led synthesis workflow designed to make scores legible, contestable, and difficult to game.

8Core dimensions

40Scored subdimensions

7Human assessment sessions

5Evidence tiers

0\u2013100Composite score scale

v1.1Methodology version

Human Assessment Battery

ACB-HAB-001 is the human-administered field guide for corporations, governments, religious institutions, and AI development organizations. It uses structured interviews, document review, observation, and community testimony rather than self-report alone.

Document ID	ACB-HAB-001
Version	1.0 — Initial Release
Companions	ACB-PAB-001 and ACB-STD-001
Administered by	Credentialed ACB assessors
Typical duration	4–6 hours per entity across 2–3 sessions
Sensitivity	Restricted assessor-use instrument

InterviewsObservationDocument ReviewCommunity Testimony

Core methodological principle

The pressure-test principle

Every dimension is assessed under adversarial conditions. For each subdimension, assessors look for at least one documented case where compassionate behavior was costly, legally risky, or institutionally inconvenient. If no such case exists, the maximum subdimension score is capped at Developing, even when the entity appears strong under favorable conditions.

In plain terms: high performance when it is easy is not treated as sufficient evidence of compassionate institutional character.

Framework overview

The benchmark preserves the same conceptual structure across sectors. What changes by entity type is the evidence model and assessment protocol, not the underlying definition of compassion.

Awareness

Whether the entity reliably detects suffering before it has to be formally named.

Empathy

Whether the entity genuinely models and honors the inner experience of affected people.

Action

Whether compassionate understanding becomes timely, effective, adequately resourced help.

Equity

Whether care is extended fairly, accessibly, and in proportion to need.

Boundaries

Whether help is ethical, sustainable, consent-based, and autonomy-preserving.

Accountability

Whether the entity acknowledges harm, corrects course, learns, and repairs.

Systemic Thinking

Whether the entity addresses root causes, second-order effects, and structural conditions.

Integrity

Whether compassion is genuine, consistent under pressure, and resilient over time.

Assessor orientation

Warmth and rigor are not opposites. Assessors document what they observe, not what leadership prefers.
Policies on paper are weaker evidence than lived practice, observed outcomes, or affected-community testimony.
Skipping the awkward question is one of the fastest ways to produce a weak assessment.
If leadership and community testimony conflict, the community account is treated as the primary reference point.

Interview principles

Ask for examples, not abstractions.
Follow the power gradient and include the least protected voices.
Name the gap explicitly when a policy exists but no applied example can be produced.
Treat deflection, silence, and refusal to provide evidence as data rather than noise.
Score conservatively when evidence is incomplete, then flag for lead assessor review.

7-session human assessment protocol

The Human Assessment Battery uses a structured sequence intended to compare leadership narrative, frontline reality, community experience, and documentary evidence before final score synthesis.

Session	Participants	Primary focus	Typical duration
1A	Senior leadership (2–3 people)	Awareness, Action, Accountability, Integrity	90 min
1B	HR / People / Ethics leads	Empathy, Boundaries	60 min
2A	Frontline staff selected by entity	Pressure-test prior leadership claims	60 min
2B	Frontline staff selected independently by assessor	Repeat and compare against entity-selected group	60 min
3A	Affected community members recruited independently	Equity and Systemic Thinking, plus lived-experience validation	90 min
3B	Solo assessor document review	Cross-check claims against records, protocols, data, and artifacts	60 min
4	Lead assessor synthesis	Score finalization, discrepancy resolution, escalation flags	60 min

Continuous research pipeline

After an initial human assessment establishes a baseline, a four-stage nightly pipeline monitors every benchmarked entity for material evidence within a 14-day recency window. Scores change only after human review.

Stage 1

Scanner

Every night, a structured search across all 1,155 benchmarked entities surfaces compassion-relevant evidence from the last 14 days. No entity is skipped. Every entity carries a provenance record of the searches that touched it.

Stage 2

Assessor

Entities with material new evidence receive a full reassessment against the 8-dimension, 40-subdimension rubric. Delta is computed against the published composite.

Stage 3

Digest

A structured digest synthesizes the night’s findings: proposed changes, sector alerts, methodology concerns, and watch items. Nothing is applied yet.

Stage 4

Founder approval

Every proposed score change is reviewed and approved — or rejected — by a human before reaching production. The approval log is auditable. Evidence older than 14 days cannot drive a change.

Each entity page on the published site carries a freshness stamp \u2014 Evidence reviewed YYYY-MM-DD \u2014 showing either that no material change surfaced in the last 14 days (green) or that new evidence is under review (orange). The scanner touches every one of the 1,155 entities daily, not only the most active ones.

The weekly briefing is free and editorial. Commercial products are separate — they do not affect scoring.

The weekly briefing on institutional compassion scores

Score changes, sector trends, and emerging risk signals from overnight research across 1,155 entities — every Monday. Free.

No spam. Unsubscribe anytime. Your email is never shared.

Structural safeguard

No automated score changes

Every proposed score change \u2014 whether generated by the overnight research pipeline, a new evidence disclosure, or a scheduled rotation \u2014 requires explicit human approval before it reaches the published index. The approval log is retained. The proposal and its evidence are retained. The decision is retained.

This gate is not a review of surface numbers. The approver examines the assessment reasoning, the evidence quality, the sector context, and any discrepancy with prior findings. Approximately 30 percent of generated proposals are sent back for additional evidence or adjusted before approval. Rejections are logged alongside approvals.

Independence policy

The commercial separation that protects benchmark integrity.

Entities never pay for inclusion, score changes, or suppression of findings.

Inclusion is editorialWhich entities are benchmarked is determined by the editorial team based on institutional significance. It cannot be purchased, sponsored, or influenced by the entity being benchmarked.

Scores are evidence-drivenA score reflects the 8-dimension framework applied to available evidence. It is not negotiable. An entity cannot pay to raise, lower, suppress, delay, or condition any score.

Commercial services are observation productsScore-Watch, Purchase Research, Data License, and Advisory are observation and interpretation products. Subscribers pay for notification, data access, or guidance — they do not influence what is scored or how.

Findings are not embargoed for subscribersEvery confirmed score change is published to /updates and the public index pages at the same time all subscribers receive their alerts. Paying subscribers do not receive scoring information ahead of the public record.

Conflicts are declaredIf an editorial or advisory relationship exists with a benchmarked entity, it is disclosed on that entity’s page.

This separation is the load-bearing trust commitment of the benchmark. If it ever appears compromised, the benchmark loses its value regardless of any other quality signal.

Evidence hierarchy

The benchmark deliberately differentiates evidence by independence and reliability. Strong scores require stronger evidence.

Tier 1

Independent external audit

Highest-weight evidence such as third-party assessments, regulatory findings, and academic studies.

Tier 2

Verifiable outcome data

High-weight evidence including disaggregated service data, longitudinal surveys, and resolution rates.

Tier 3

Community testimony

High-weight evidence from affected populations, independent focus groups, and structured interviews.

Tier 4

Policy and process documents

Moderate-weight evidence such as governing documents, training records, and budget allocations.

Tier 5

Entity self-report

Lowest-weight evidence including mission statements and annual reports, requiring corroboration from stronger tiers.

Interpretation rule

Evidence beats aspiration

Where paper claims and lived experience diverge, the methodology scores the world as encountered, not the story as presented.

Common scoring model

Each subdimension is scored on a 0\u20135 anchored behavioral scale. The five subdimensions within a dimension are summed and converted into a dimension score out of 10. The eight dimension scores together create a base total out of 80, which is then combined with an integration premium worth up to 10 additional points to produce a 0\u2013100 composite.

A score of 0 represents active documented harm and requires lead assessor co-sign. A score of 4 or 5 is provisional unless there is pressure-test evidence.

Integration premium logic

The premium rewards consistent compassionate performance across dimensions rather than isolated strengths. Harm override sets the premium to zero when any subdimension scores 0. The premium is reduced when dimension scores are uneven and weakened further for each dimension that falls below 4.0.

Std. dev. ≤ 1.5 → 100% consistency factor
Std. dev. 1.5\u20133.0 → 75%
Std. dev. 3.0\u20135.0 → 40%
Std. dev. > 5.0 → 10%
Weakness penalty: minus 20% for each dimension below 4.0

Rubric anchors and score bands

The Human Assessment Battery uses universal behavioral anchors at the subdimension level and a five-band public interpretation model at the composite level.

Score	Anchor	Meaning
0	Active Harm	Specific documented harm in the domain; lead assessor co-sign required.
1	Absent	No meaningful capacity exists.
2	Minimal	Nominal capacity exists but fails under pressure and does not produce consistent real-world outcomes.
3	Developing	Good-faith capacity exists in some cases, but not consistently or comprehensively.
4	Established	Consistent operational capacity across most cases; community confirms positive experience.
5	Exemplary	Outstanding independently verified performance sustained under significant pressure.

critical0–20Active harm or fundamental compassionate failure.

developing21–40Nominal capacity in some areas, but major gaps remain.

functional41–60Basic compassionate capacity exists, though significant weakness remains.

established61–80Consistent, pressure-tested, independently supported performance.

exemplary81–100Sector-leading performance with no weak dimensions and strong evidence.

Lead assessor review flags

Certain patterns automatically trigger deeper review before a score is finalized.

Active harm

Any subdimension scored 0 requires written documentation and lead assessor co-sign.

Rater discrepancy

IRR discrepancy greater than 1.5 on any subdimension triggers review.

Unsupported high scores

A score of 4 or 5 without pressure-test evidence is flagged provisional.

Leadership-community gap

Significant differences between leadership narrative and community testimony must be resolved.

Missing documents

Refusal to provide requested documentation is itself a score-relevant event.

Open discussion flags

Any unresolved discussion note blocks finalization until reviewed.

Full 40-subdimension framework

Each dimension contains five scored subdimensions. Together they define the operational content of the standard.

Dimension	ID	Subdimension	Core assessment question
Awareness	A1	Suffering Detection	Does this entity reliably detect when others are in pain or need before they explicitly name it?
Awareness	A2	Contextual Sensitivity	Does awareness adjust to the actual populations being served, rather than to default assumptions?
Awareness	A3	Blind Spot Mitigation	Does the entity actively seek out the suffering it is currently missing?
Awareness	A4	Signal Amplification	Does it make visible the suffering of those who cannot easily speak for themselves?
Awareness	A5	Anticipatory Awareness	Can the entity foresee potential harms before they manifest?
Empathy	E1	Affective Resonance	Do people feel genuinely cared about rather than merely processed?
Empathy	E2	Perspective-Taking	Does the entity model the inner experience of those it serves, especially those far from leadership power?
Empathy	E3	Non-Judgment	Does it suspend judgment across identity, behavior, and belief differences under pressure?
Empathy	E4	Validation	Does it affirm the legitimacy of others’ experiences, even when inconvenient?
Empathy	E5	Cultural Empathy	Does it integrate diverse cultural ways of knowing into practice rather than offering surface accommodation?
Action	AC1	Responsiveness	Do identified needs receive timely, appropriately prioritized response?
Action	AC2	Proportionality	Is help calibrated to actual need, not simply to what is easiest to provide?
Action	AC3	Efficacy	Does the help actually reduce suffering rather than just creating activity that looks like help?
Action	AC4	Resource Mobilization	Does the entity bring adequate resources to the problems it has identified?
Action	AC5	Follow-Through	Does the entity persist rather than disengage when attention moves on?
Equity	EQ1	Universality	Does care extend to all people regardless of identity?
Equity	EQ2	Priority for the Vulnerable	Are those with the greatest need actually prioritized?
Equity	EQ3	Bias Awareness	Does the entity identify and correct bias in who receives care and how?
Equity	EQ4	Access Design	Are services designed to be accessible to those who need them most?
Equity	EQ5	Historical Harm Acknowledgment	Does the entity recognize and respond to historical harms associated with itself or its predecessors?
Boundaries	B1	Self-Sustainability	Does compassionate work come from a stable, non-depleting foundation?
Boundaries	B2	Autonomy Preservation	Does help build self-determination rather than dependency?
Boundaries	B3	Scope Clarity	Does the entity communicate honestly about what it can and cannot do?
Boundaries	B4	Refusal Ethics	When the entity declines to help, is refusal delivered with dignity and alternatives?
Boundaries	B5	Consent Orientation	Does it obtain genuine informed consent before acting?
Accountability	AB1	Harm Acknowledgment	Does the entity acknowledge harm it has caused without deflection?
Accountability	AB2	Correction Willingness	Does it change course when shown it is causing harm?
Accountability	AB3	Transparency	Does it operate with genuine transparency about performance and failures?
Accountability	AB4	Systemic Learning	Does it institutionally learn from failures?
Accountability	AB5	Reparative Action	Does it make concrete repair to those it has harmed?
Systemic Thinking	S1	Root Cause Orientation	Does the entity address causes of suffering, not only symptoms?
Systemic Thinking	S2	Long-Term Impact	Does it plan for long-horizon effects of its actions?
Systemic Thinking	S3	Interconnection Awareness	Does it understand effects on adjacent systems and second-order consequences?
Systemic Thinking	S4	Structural Critique	Does it critically examine structures that perpetuate suffering, including those from which it benefits?
Systemic Thinking	S5	Coalitional Compassion	Does it collaborate in ways that amplify impact beyond its own institutional capacity?
Integrity	I1	Consistency Under Pressure	Does compassionate behavior hold when it is costly?
Integrity	I2	Non-Performance	Is compassion genuine rather than reputationally driven?
Integrity	I3	Internal Consistency	Does the entity treat internal stakeholders with the same compassion it claims externally?
Integrity	I4	Values Alignment	Are major decisions actually aligned with stated values?
Integrity	I5	Resilience of Care	Does compassion persist across leadership change and institutional stress?

What assessors are looking for in practice

The Human Assessment Battery turns abstract values into observable behaviors, evidence requests, and comparison points across populations and power levels.

Awareness examples

Soft-signal reporting, proactive outreach, silent-population detection, pre-launch harm assessment.

Empathy examples

Community testimony, direct-service observation, validation before procedure, leadership veto power for affected groups.

Action examples

Response-time data, proportional help, independent outcome studies, follow-through protocols.

Equity examples

Coverage gaps, disaggregated outcomes, bias audits, barrier-removal evidence, historical harm response.

Boundaries examples

Burnout prevention, autonomy measurement, scope clarity, dignified refusal, informed consent withdrawal.

Accountability examples

Public acknowledgment of harm, change after failure, transparency about poor outcomes, co-designed repair.

Systemic examples

Root-cause work, long-range planning, adjacent-system analysis, structural critique, coalition-building.

Integrity examples

Costly moral choices, invisible compassionate practices, staff treatment, values-based decisions, continuity through stress.

Cross-sector adaptation

The same framework can be adapted across governments, corporations, NGOs, religious institutions, AI labs, technology systems, products, and teams. The human battery is especially important when community interviews, leadership interviews, and observation are necessary to understand whether compassionate behavior actually exists in practice.

In the broader ACB architecture, AI systems may also be evaluated with the AI Prompt Battery while organizations behind those systems are evaluated using the Human Assessment Battery.

Methodology intent

The benchmark is designed to be interpretable, reproducible, and contestable. It is meant to reward genuine compassionate capacity, expose performative signaling, and create a shared language for institutional behavior that can be compared over time and across sectors.

The final submission test is simple: could the assessor defend the score in front of both leadership and the affected community in the same room?

Methodology version and change log

Methodology changes are versioned, dated, and publicly described. Historical changes do not retroactively rewrite prior assessments.

Version 1.1 \u2014 2026-04-20

Integration premium capped at +10 points (down from +20). Rationale: entities with uniform-high dimension profiles were computing to perfect 100 regardless of evidence quality. The cap ensures the premium rewards consistency without overriding evidence ceilings.
Composite score determinism enforced. Every composite must now compute deterministically from its dimension scores via the published formula. A data-layer validator rejects drift above 2.0 points.
Floor-clamping artifacts corrected. Entities previously displayed at composite 0.0 (a legacy display-layer artifact) now show their formula-computed composites \u2014 typically 4 to 7 points reflecting the actual dimension scores.

Version 1.0 \u2014 Initial release

8-dimension, 40-subdimension framework.
7-session human assessment protocol.
5-tier evidence hierarchy.
Integration premium up to +20 (superseded by v1.1).

See every approved score change View full changelog on GitHub

Explore the benchmark

See how the methodology is applied across the current published index families.

Indexes

Apply it to your organization

Use the framework as an entry point into guided review, advisory, or formal structured assessment.

Assess Your Organization

Use the benchmark seriously

Purchase reports, license data, book advisory work, or discuss a broader institutional relationship.

Purchase Research·Data License·Advisory·Contact Sales