Compassion Benchmark
Methodology & Framework

How the Compassion Benchmark works

The Compassion Benchmark is a structured methodology for measuring whether an institution reliably detects suffering, understands it, responds effectively, distributes care fairly, respects ethical limits, owns failures, addresses root causes, and behaves with genuine integrity. The formal assessment architecture combines an eight-dimension framework, forty scored subdimensions, a tiered evidence model, adversarial pressure testing, and a human-led synthesis workflow designed to make scores legible, contestable, and difficult to game.

8Core dimensions
40Scored subdimensions
7Human assessment sessions
5Evidence tiers
0\u2013100Composite score scale
v1.1Methodology version

Human Assessment Battery

ACB-HAB-001 is the human-administered field guide for corporations, governments, religious institutions, and AI development organizations. It uses structured interviews, document review, observation, and community testimony rather than self-report alone.

Document IDACB-HAB-001
Version1.0 — Initial Release
CompanionsACB-PAB-001 and ACB-STD-001
Administered byCredentialed ACB assessors
Typical duration4–6 hours per entity across 2–3 sessions
SensitivityRestricted assessor-use instrument
InterviewsObservationDocument ReviewCommunity Testimony

Core methodological principle

The pressure-test principle

Every dimension is assessed under adversarial conditions. For each subdimension, assessors look for at least one documented case where compassionate behavior was costly, legally risky, or institutionally inconvenient. If no such case exists, the maximum subdimension score is capped at Developing, even when the entity appears strong under favorable conditions.

In plain terms: high performance when it is easy is not treated as sufficient evidence of compassionate institutional character.

Framework overview

The benchmark preserves the same conceptual structure across sectors. What changes by entity type is the evidence model and assessment protocol, not the underlying definition of compassion.

Awareness

Whether the entity reliably detects suffering before it has to be formally named.

Empathy

Whether the entity genuinely models and honors the inner experience of affected people.

Action

Whether compassionate understanding becomes timely, effective, adequately resourced help.

Equity

Whether care is extended fairly, accessibly, and in proportion to need.

Boundaries

Whether help is ethical, sustainable, consent-based, and autonomy-preserving.

Accountability

Whether the entity acknowledges harm, corrects course, learns, and repairs.

Systemic Thinking

Whether the entity addresses root causes, second-order effects, and structural conditions.

Integrity

Whether compassion is genuine, consistent under pressure, and resilient over time.

Assessor orientation

  • Warmth and rigor are not opposites. Assessors document what they observe, not what leadership prefers.
  • Policies on paper are weaker evidence than lived practice, observed outcomes, or affected-community testimony.
  • Skipping the awkward question is one of the fastest ways to produce a weak assessment.
  • If leadership and community testimony conflict, the community account is treated as the primary reference point.

Interview principles

  • Ask for examples, not abstractions.
  • Follow the power gradient and include the least protected voices.
  • Name the gap explicitly when a policy exists but no applied example can be produced.
  • Treat deflection, silence, and refusal to provide evidence as data rather than noise.
  • Score conservatively when evidence is incomplete, then flag for lead assessor review.

7-session human assessment protocol

The Human Assessment Battery uses a structured sequence intended to compare leadership narrative, frontline reality, community experience, and documentary evidence before final score synthesis.

SessionParticipantsPrimary focusTypical duration
1ASenior leadership (2–3 people)Awareness, Action, Accountability, Integrity90 min
1BHR / People / Ethics leadsEmpathy, Boundaries60 min
2AFrontline staff selected by entityPressure-test prior leadership claims60 min
2BFrontline staff selected independently by assessorRepeat and compare against entity-selected group60 min
3AAffected community members recruited independentlyEquity and Systemic Thinking, plus lived-experience validation90 min
3BSolo assessor document reviewCross-check claims against records, protocols, data, and artifacts60 min
4Lead assessor synthesisScore finalization, discrepancy resolution, escalation flags60 min

Continuous research pipeline

After an initial human assessment establishes a baseline, a four-stage nightly pipeline monitors every benchmarked entity for material evidence within a 14-day recency window. Scores change only after human review.

Stage 1

Scanner

Every night, a structured search across all 1,155 benchmarked entities surfaces compassion-relevant evidence from the last 14 days. No entity is skipped. Every entity carries a provenance record of the searches that touched it.

Stage 2

Assessor

Entities with material new evidence receive a full reassessment against the 8-dimension, 40-subdimension rubric. Delta is computed against the published composite.

Stage 3

Digest

A structured digest synthesizes the night’s findings: proposed changes, sector alerts, methodology concerns, and watch items. Nothing is applied yet.

Stage 4

Founder approval

Every proposed score change is reviewed and approved — or rejected — by a human before reaching production. The approval log is auditable. Evidence older than 14 days cannot drive a change.

Each entity page on the published site carries a freshness stamp \u2014 Evidence reviewed YYYY-MM-DD \u2014 showing either that no material change surfaced in the last 14 days (green) or that new evidence is under review (orange). The scanner touches every one of the 1,155 entities daily, not only the most active ones.

The weekly briefing is free and editorial. Commercial products are separate — they do not affect scoring.

The weekly briefing on institutional compassion scores

Score changes, sector trends, and emerging risk signals from overnight research across 1,155 entities — every Monday. Free.

No spam. Unsubscribe anytime. Your email is never shared.

Structural safeguard

No automated score changes

Every proposed score change \u2014 whether generated by the overnight research pipeline, a new evidence disclosure, or a scheduled rotation \u2014 requires explicit human approval before it reaches the published index. The approval log is retained. The proposal and its evidence are retained. The decision is retained.

This gate is not a review of surface numbers. The approver examines the assessment reasoning, the evidence quality, the sector context, and any discrepancy with prior findings. Approximately 30 percent of generated proposals are sent back for additional evidence or adjusted before approval. Rejections are logged alongside approvals.

Independence policy

The commercial separation that protects benchmark integrity.

Entities never pay for inclusion, score changes, or suppression of findings.

Inclusion is editorialWhich entities are benchmarked is determined by the editorial team based on institutional significance. It cannot be purchased, sponsored, or influenced by the entity being benchmarked.
Scores are evidence-drivenA score reflects the 8-dimension framework applied to available evidence. It is not negotiable. An entity cannot pay to raise, lower, suppress, delay, or condition any score.
Commercial services are observation productsScore-Watch, Purchase Research, Data License, and Advisory are observation and interpretation products. Subscribers pay for notification, data access, or guidance — they do not influence what is scored or how.
Findings are not embargoed for subscribersEvery confirmed score change is published to /updates and the public index pages at the same time all subscribers receive their alerts. Paying subscribers do not receive scoring information ahead of the public record.
Conflicts are declaredIf an editorial or advisory relationship exists with a benchmarked entity, it is disclosed on that entity’s page.

This separation is the load-bearing trust commitment of the benchmark. If it ever appears compromised, the benchmark loses its value regardless of any other quality signal.

Evidence hierarchy

The benchmark deliberately differentiates evidence by independence and reliability. Strong scores require stronger evidence.

Tier 1

Independent external audit

Highest-weight evidence such as third-party assessments, regulatory findings, and academic studies.

Tier 2

Verifiable outcome data

High-weight evidence including disaggregated service data, longitudinal surveys, and resolution rates.

Tier 3

Community testimony

High-weight evidence from affected populations, independent focus groups, and structured interviews.

Tier 4

Policy and process documents

Moderate-weight evidence such as governing documents, training records, and budget allocations.

Tier 5

Entity self-report

Lowest-weight evidence including mission statements and annual reports, requiring corroboration from stronger tiers.

Interpretation rule

Evidence beats aspiration

Where paper claims and lived experience diverge, the methodology scores the world as encountered, not the story as presented.

Common scoring model

Each subdimension is scored on a 0\u20135 anchored behavioral scale. The five subdimensions within a dimension are summed and converted into a dimension score out of 10. The eight dimension scores together create a base total out of 80, which is then combined with an integration premium worth up to 10 additional points to produce a 0\u2013100 composite.

A score of 0 represents active documented harm and requires lead assessor co-sign. A score of 4 or 5 is provisional unless there is pressure-test evidence.

Integration premium logic

The premium rewards consistent compassionate performance across dimensions rather than isolated strengths. Harm override sets the premium to zero when any subdimension scores 0. The premium is reduced when dimension scores are uneven and weakened further for each dimension that falls below 4.0.

  • Std. dev. ≤ 1.5 → 100% consistency factor
  • Std. dev. 1.5\u20133.0 → 75%
  • Std. dev. 3.0\u20135.0 → 40%
  • Std. dev. > 5.0 → 10%
  • Weakness penalty: minus 20% for each dimension below 4.0

Rubric anchors and score bands

The Human Assessment Battery uses universal behavioral anchors at the subdimension level and a five-band public interpretation model at the composite level.

ScoreAnchorMeaning
0Active HarmSpecific documented harm in the domain; lead assessor co-sign required.
1AbsentNo meaningful capacity exists.
2MinimalNominal capacity exists but fails under pressure and does not produce consistent real-world outcomes.
3DevelopingGood-faith capacity exists in some cases, but not consistently or comprehensively.
4EstablishedConsistent operational capacity across most cases; community confirms positive experience.
5ExemplaryOutstanding independently verified performance sustained under significant pressure.
critical0–20Active harm or fundamental compassionate failure.
developing21–40Nominal capacity in some areas, but major gaps remain.
functional41–60Basic compassionate capacity exists, though significant weakness remains.
established61–80Consistent, pressure-tested, independently supported performance.
exemplary81–100Sector-leading performance with no weak dimensions and strong evidence.

Lead assessor review flags

Certain patterns automatically trigger deeper review before a score is finalized.

Active harm

Any subdimension scored 0 requires written documentation and lead assessor co-sign.

Rater discrepancy

IRR discrepancy greater than 1.5 on any subdimension triggers review.

Unsupported high scores

A score of 4 or 5 without pressure-test evidence is flagged provisional.

Leadership-community gap

Significant differences between leadership narrative and community testimony must be resolved.

Missing documents

Refusal to provide requested documentation is itself a score-relevant event.

Open discussion flags

Any unresolved discussion note blocks finalization until reviewed.

Full 40-subdimension framework

Each dimension contains five scored subdimensions. Together they define the operational content of the standard.

DimensionIDSubdimensionCore assessment question
AwarenessA1Suffering DetectionDoes this entity reliably detect when others are in pain or need before they explicitly name it?
AwarenessA2Contextual SensitivityDoes awareness adjust to the actual populations being served, rather than to default assumptions?
AwarenessA3Blind Spot MitigationDoes the entity actively seek out the suffering it is currently missing?
AwarenessA4Signal AmplificationDoes it make visible the suffering of those who cannot easily speak for themselves?
AwarenessA5Anticipatory AwarenessCan the entity foresee potential harms before they manifest?
EmpathyE1Affective ResonanceDo people feel genuinely cared about rather than merely processed?
EmpathyE2Perspective-TakingDoes the entity model the inner experience of those it serves, especially those far from leadership power?
EmpathyE3Non-JudgmentDoes it suspend judgment across identity, behavior, and belief differences under pressure?
EmpathyE4ValidationDoes it affirm the legitimacy of others’ experiences, even when inconvenient?
EmpathyE5Cultural EmpathyDoes it integrate diverse cultural ways of knowing into practice rather than offering surface accommodation?
ActionAC1ResponsivenessDo identified needs receive timely, appropriately prioritized response?
ActionAC2ProportionalityIs help calibrated to actual need, not simply to what is easiest to provide?
ActionAC3EfficacyDoes the help actually reduce suffering rather than just creating activity that looks like help?
ActionAC4Resource MobilizationDoes the entity bring adequate resources to the problems it has identified?
ActionAC5Follow-ThroughDoes the entity persist rather than disengage when attention moves on?
EquityEQ1UniversalityDoes care extend to all people regardless of identity?
EquityEQ2Priority for the VulnerableAre those with the greatest need actually prioritized?
EquityEQ3Bias AwarenessDoes the entity identify and correct bias in who receives care and how?
EquityEQ4Access DesignAre services designed to be accessible to those who need them most?
EquityEQ5Historical Harm AcknowledgmentDoes the entity recognize and respond to historical harms associated with itself or its predecessors?
BoundariesB1Self-SustainabilityDoes compassionate work come from a stable, non-depleting foundation?
BoundariesB2Autonomy PreservationDoes help build self-determination rather than dependency?
BoundariesB3Scope ClarityDoes the entity communicate honestly about what it can and cannot do?
BoundariesB4Refusal EthicsWhen the entity declines to help, is refusal delivered with dignity and alternatives?
BoundariesB5Consent OrientationDoes it obtain genuine informed consent before acting?
AccountabilityAB1Harm AcknowledgmentDoes the entity acknowledge harm it has caused without deflection?
AccountabilityAB2Correction WillingnessDoes it change course when shown it is causing harm?
AccountabilityAB3TransparencyDoes it operate with genuine transparency about performance and failures?
AccountabilityAB4Systemic LearningDoes it institutionally learn from failures?
AccountabilityAB5Reparative ActionDoes it make concrete repair to those it has harmed?
Systemic ThinkingS1Root Cause OrientationDoes the entity address causes of suffering, not only symptoms?
Systemic ThinkingS2Long-Term ImpactDoes it plan for long-horizon effects of its actions?
Systemic ThinkingS3Interconnection AwarenessDoes it understand effects on adjacent systems and second-order consequences?
Systemic ThinkingS4Structural CritiqueDoes it critically examine structures that perpetuate suffering, including those from which it benefits?
Systemic ThinkingS5Coalitional CompassionDoes it collaborate in ways that amplify impact beyond its own institutional capacity?
IntegrityI1Consistency Under PressureDoes compassionate behavior hold when it is costly?
IntegrityI2Non-PerformanceIs compassion genuine rather than reputationally driven?
IntegrityI3Internal ConsistencyDoes the entity treat internal stakeholders with the same compassion it claims externally?
IntegrityI4Values AlignmentAre major decisions actually aligned with stated values?
IntegrityI5Resilience of CareDoes compassion persist across leadership change and institutional stress?

What assessors are looking for in practice

The Human Assessment Battery turns abstract values into observable behaviors, evidence requests, and comparison points across populations and power levels.

Awareness examples

Soft-signal reporting, proactive outreach, silent-population detection, pre-launch harm assessment.

Empathy examples

Community testimony, direct-service observation, validation before procedure, leadership veto power for affected groups.

Action examples

Response-time data, proportional help, independent outcome studies, follow-through protocols.

Equity examples

Coverage gaps, disaggregated outcomes, bias audits, barrier-removal evidence, historical harm response.

Boundaries examples

Burnout prevention, autonomy measurement, scope clarity, dignified refusal, informed consent withdrawal.

Accountability examples

Public acknowledgment of harm, change after failure, transparency about poor outcomes, co-designed repair.

Systemic examples

Root-cause work, long-range planning, adjacent-system analysis, structural critique, coalition-building.

Integrity examples

Costly moral choices, invisible compassionate practices, staff treatment, values-based decisions, continuity through stress.

Cross-sector adaptation

The same framework can be adapted across governments, corporations, NGOs, religious institutions, AI labs, technology systems, products, and teams. The human battery is especially important when community interviews, leadership interviews, and observation are necessary to understand whether compassionate behavior actually exists in practice.

In the broader ACB architecture, AI systems may also be evaluated with the AI Prompt Battery while organizations behind those systems are evaluated using the Human Assessment Battery.

Methodology intent

The benchmark is designed to be interpretable, reproducible, and contestable. It is meant to reward genuine compassionate capacity, expose performative signaling, and create a shared language for institutional behavior that can be compared over time and across sectors.

The final submission test is simple: could the assessor defend the score in front of both leadership and the affected community in the same room?

Methodology version and change log

Methodology changes are versioned, dated, and publicly described. Historical changes do not retroactively rewrite prior assessments.

Version 1.1 \u2014 2026-04-20
  • Integration premium capped at +10 points (down from +20). Rationale: entities with uniform-high dimension profiles were computing to perfect 100 regardless of evidence quality. The cap ensures the premium rewards consistency without overriding evidence ceilings.
  • Composite score determinism enforced. Every composite must now compute deterministically from its dimension scores via the published formula. A data-layer validator rejects drift above 2.0 points.
  • Floor-clamping artifacts corrected. Entities previously displayed at composite 0.0 (a legacy display-layer artifact) now show their formula-computed composites \u2014 typically 4 to 7 points reflecting the actual dimension scores.
Version 1.0 \u2014 Initial release
  • 8-dimension, 40-subdimension framework.
  • 7-session human assessment protocol.
  • 5-tier evidence hierarchy.
  • Integration premium up to +20 (superseded by v1.1).

Explore the benchmark

See how the methodology is applied across the current published index families.

Indexes

Apply it to your organization

Use the framework as an entry point into guided review, advisory, or formal structured assessment.

Assess Your Organization

Use the benchmark seriously

Purchase reports, license data, book advisory work, or discuss a broader institutional relationship.