How the Compassion Benchmark works
The Compassion Benchmark is a structured methodology for measuring whether an institution reliably detects suffering, understands it, responds effectively, distributes care fairly, respects ethical limits, owns failures, addresses root causes, and behaves with genuine integrity. The formal assessment architecture combines an eight-dimension framework, forty scored subdimensions, a tiered evidence model, adversarial pressure testing, and a human-led synthesis workflow designed to make scores legible, contestable, and difficult to game.
Human Assessment Battery
ACB-HAB-001 is the human-administered field guide for corporations, governments, religious institutions, and AI development organizations. It uses structured interviews, document review, observation, and community testimony rather than self-report alone.
| Document ID | ACB-HAB-001 |
|---|---|
| Version | 1.0 — Initial Release |
| Companions | ACB-PAB-001 and ACB-STD-001 |
| Administered by | Credentialed ACB assessors |
| Typical duration | 4–6 hours per entity across 2–3 sessions |
| Sensitivity | Restricted assessor-use instrument |
Core methodological principle
The pressure-test principle
Every dimension is assessed under adversarial conditions. For each subdimension, assessors look for at least one documented case where compassionate behavior was costly, legally risky, or institutionally inconvenient. If no such case exists, the maximum subdimension score is capped at Developing, even when the entity appears strong under favorable conditions.
In plain terms: high performance when it is easy is not treated as sufficient evidence of compassionate institutional character.
Framework overview
The benchmark preserves the same conceptual structure across sectors. What changes by entity type is the evidence model and assessment protocol, not the underlying definition of compassion.
Awareness
Whether the entity reliably detects suffering before it has to be formally named.
Empathy
Whether the entity genuinely models and honors the inner experience of affected people.
Action
Whether compassionate understanding becomes timely, effective, adequately resourced help.
Equity
Whether care is extended fairly, accessibly, and in proportion to need.
Boundaries
Whether help is ethical, sustainable, consent-based, and autonomy-preserving.
Accountability
Whether the entity acknowledges harm, corrects course, learns, and repairs.
Systemic Thinking
Whether the entity addresses root causes, second-order effects, and structural conditions.
Integrity
Whether compassion is genuine, consistent under pressure, and resilient over time.
Assessor orientation
- Warmth and rigor are not opposites. Assessors document what they observe, not what leadership prefers.
- Policies on paper are weaker evidence than lived practice, observed outcomes, or affected-community testimony.
- Skipping the awkward question is one of the fastest ways to produce a weak assessment.
- If leadership and community testimony conflict, the community account is treated as the primary reference point.
Interview principles
- Ask for examples, not abstractions.
- Follow the power gradient and include the least protected voices.
- Name the gap explicitly when a policy exists but no applied example can be produced.
- Treat deflection, silence, and refusal to provide evidence as data rather than noise.
- Score conservatively when evidence is incomplete, then flag for lead assessor review.
7-session human assessment protocol
The Human Assessment Battery uses a structured sequence intended to compare leadership narrative, frontline reality, community experience, and documentary evidence before final score synthesis.
| Session | Participants | Primary focus | Typical duration |
|---|---|---|---|
| 1A | Senior leadership (2–3 people) | Awareness, Action, Accountability, Integrity | 90 min |
| 1B | HR / People / Ethics leads | Empathy, Boundaries | 60 min |
| 2A | Frontline staff selected by entity | Pressure-test prior leadership claims | 60 min |
| 2B | Frontline staff selected independently by assessor | Repeat and compare against entity-selected group | 60 min |
| 3A | Affected community members recruited independently | Equity and Systemic Thinking, plus lived-experience validation | 90 min |
| 3B | Solo assessor document review | Cross-check claims against records, protocols, data, and artifacts | 60 min |
| 4 | Lead assessor synthesis | Score finalization, discrepancy resolution, escalation flags | 60 min |
Continuous research pipeline
After an initial human assessment establishes a baseline, a four-stage nightly pipeline monitors every benchmarked entity for material evidence within a 14-day recency window. Scores change only after human review.
Stage 1
Scanner
Every night, a structured search across all 1,155 benchmarked entities surfaces compassion-relevant evidence from the last 14 days. No entity is skipped. Every entity carries a provenance record of the searches that touched it.
Stage 2
Assessor
Entities with material new evidence receive a full reassessment against the 8-dimension, 40-subdimension rubric. Delta is computed against the published composite.
Stage 3
Digest
A structured digest synthesizes the night’s findings: proposed changes, sector alerts, methodology concerns, and watch items. Nothing is applied yet.
Stage 4
Founder approval
Every proposed score change is reviewed and approved — or rejected — by a human before reaching production. The approval log is auditable. Evidence older than 14 days cannot drive a change.
Each entity page on the published site carries a freshness stamp \u2014 Evidence reviewed YYYY-MM-DD \u2014 showing either that no material change surfaced in the last 14 days (green) or that new evidence is under review (orange). The scanner touches every one of the 1,155 entities daily, not only the most active ones.
The weekly briefing is free and editorial. Commercial products are separate — they do not affect scoring.
The weekly briefing on institutional compassion scores
Score changes, sector trends, and emerging risk signals from overnight research across 1,155 entities — every Monday. Free.
No spam. Unsubscribe anytime. Your email is never shared.
Structural safeguard
No automated score changes
Every proposed score change \u2014 whether generated by the overnight research pipeline, a new evidence disclosure, or a scheduled rotation \u2014 requires explicit human approval before it reaches the published index. The approval log is retained. The proposal and its evidence are retained. The decision is retained.
This gate is not a review of surface numbers. The approver examines the assessment reasoning, the evidence quality, the sector context, and any discrepancy with prior findings. Approximately 30 percent of generated proposals are sent back for additional evidence or adjusted before approval. Rejections are logged alongside approvals.
Independence policy
The commercial separation that protects benchmark integrity.
Entities never pay for inclusion, score changes, or suppression of findings.
This separation is the load-bearing trust commitment of the benchmark. If it ever appears compromised, the benchmark loses its value regardless of any other quality signal.
Evidence hierarchy
The benchmark deliberately differentiates evidence by independence and reliability. Strong scores require stronger evidence.
Tier 1
Independent external audit
Highest-weight evidence such as third-party assessments, regulatory findings, and academic studies.
Tier 2
Verifiable outcome data
High-weight evidence including disaggregated service data, longitudinal surveys, and resolution rates.
Tier 3
Community testimony
High-weight evidence from affected populations, independent focus groups, and structured interviews.
Tier 4
Policy and process documents
Moderate-weight evidence such as governing documents, training records, and budget allocations.
Tier 5
Entity self-report
Lowest-weight evidence including mission statements and annual reports, requiring corroboration from stronger tiers.
Interpretation rule
Evidence beats aspiration
Where paper claims and lived experience diverge, the methodology scores the world as encountered, not the story as presented.
Common scoring model
Each subdimension is scored on a 0\u20135 anchored behavioral scale. The five subdimensions within a dimension are summed and converted into a dimension score out of 10. The eight dimension scores together create a base total out of 80, which is then combined with an integration premium worth up to 10 additional points to produce a 0\u2013100 composite.
A score of 0 represents active documented harm and requires lead assessor co-sign. A score of 4 or 5 is provisional unless there is pressure-test evidence.
Integration premium logic
The premium rewards consistent compassionate performance across dimensions rather than isolated strengths. Harm override sets the premium to zero when any subdimension scores 0. The premium is reduced when dimension scores are uneven and weakened further for each dimension that falls below 4.0.
- Std. dev. ≤ 1.5 → 100% consistency factor
- Std. dev. 1.5\u20133.0 → 75%
- Std. dev. 3.0\u20135.0 → 40%
- Std. dev. > 5.0 → 10%
- Weakness penalty: minus 20% for each dimension below 4.0
Rubric anchors and score bands
The Human Assessment Battery uses universal behavioral anchors at the subdimension level and a five-band public interpretation model at the composite level.
| Score | Anchor | Meaning |
|---|---|---|
| 0 | Active Harm | Specific documented harm in the domain; lead assessor co-sign required. |
| 1 | Absent | No meaningful capacity exists. |
| 2 | Minimal | Nominal capacity exists but fails under pressure and does not produce consistent real-world outcomes. |
| 3 | Developing | Good-faith capacity exists in some cases, but not consistently or comprehensively. |
| 4 | Established | Consistent operational capacity across most cases; community confirms positive experience. |
| 5 | Exemplary | Outstanding independently verified performance sustained under significant pressure. |
Lead assessor review flags
Certain patterns automatically trigger deeper review before a score is finalized.
Active harm
Any subdimension scored 0 requires written documentation and lead assessor co-sign.
Rater discrepancy
IRR discrepancy greater than 1.5 on any subdimension triggers review.
Unsupported high scores
A score of 4 or 5 without pressure-test evidence is flagged provisional.
Leadership-community gap
Significant differences between leadership narrative and community testimony must be resolved.
Missing documents
Refusal to provide requested documentation is itself a score-relevant event.
Open discussion flags
Any unresolved discussion note blocks finalization until reviewed.
Full 40-subdimension framework
Each dimension contains five scored subdimensions. Together they define the operational content of the standard.
| Dimension | ID | Subdimension | Core assessment question |
|---|---|---|---|
| Awareness | A1 | Suffering Detection | Does this entity reliably detect when others are in pain or need before they explicitly name it? |
| Awareness | A2 | Contextual Sensitivity | Does awareness adjust to the actual populations being served, rather than to default assumptions? |
| Awareness | A3 | Blind Spot Mitigation | Does the entity actively seek out the suffering it is currently missing? |
| Awareness | A4 | Signal Amplification | Does it make visible the suffering of those who cannot easily speak for themselves? |
| Awareness | A5 | Anticipatory Awareness | Can the entity foresee potential harms before they manifest? |
| Empathy | E1 | Affective Resonance | Do people feel genuinely cared about rather than merely processed? |
| Empathy | E2 | Perspective-Taking | Does the entity model the inner experience of those it serves, especially those far from leadership power? |
| Empathy | E3 | Non-Judgment | Does it suspend judgment across identity, behavior, and belief differences under pressure? |
| Empathy | E4 | Validation | Does it affirm the legitimacy of others’ experiences, even when inconvenient? |
| Empathy | E5 | Cultural Empathy | Does it integrate diverse cultural ways of knowing into practice rather than offering surface accommodation? |
| Action | AC1 | Responsiveness | Do identified needs receive timely, appropriately prioritized response? |
| Action | AC2 | Proportionality | Is help calibrated to actual need, not simply to what is easiest to provide? |
| Action | AC3 | Efficacy | Does the help actually reduce suffering rather than just creating activity that looks like help? |
| Action | AC4 | Resource Mobilization | Does the entity bring adequate resources to the problems it has identified? |
| Action | AC5 | Follow-Through | Does the entity persist rather than disengage when attention moves on? |
| Equity | EQ1 | Universality | Does care extend to all people regardless of identity? |
| Equity | EQ2 | Priority for the Vulnerable | Are those with the greatest need actually prioritized? |
| Equity | EQ3 | Bias Awareness | Does the entity identify and correct bias in who receives care and how? |
| Equity | EQ4 | Access Design | Are services designed to be accessible to those who need them most? |
| Equity | EQ5 | Historical Harm Acknowledgment | Does the entity recognize and respond to historical harms associated with itself or its predecessors? |
| Boundaries | B1 | Self-Sustainability | Does compassionate work come from a stable, non-depleting foundation? |
| Boundaries | B2 | Autonomy Preservation | Does help build self-determination rather than dependency? |
| Boundaries | B3 | Scope Clarity | Does the entity communicate honestly about what it can and cannot do? |
| Boundaries | B4 | Refusal Ethics | When the entity declines to help, is refusal delivered with dignity and alternatives? |
| Boundaries | B5 | Consent Orientation | Does it obtain genuine informed consent before acting? |
| Accountability | AB1 | Harm Acknowledgment | Does the entity acknowledge harm it has caused without deflection? |
| Accountability | AB2 | Correction Willingness | Does it change course when shown it is causing harm? |
| Accountability | AB3 | Transparency | Does it operate with genuine transparency about performance and failures? |
| Accountability | AB4 | Systemic Learning | Does it institutionally learn from failures? |
| Accountability | AB5 | Reparative Action | Does it make concrete repair to those it has harmed? |
| Systemic Thinking | S1 | Root Cause Orientation | Does the entity address causes of suffering, not only symptoms? |
| Systemic Thinking | S2 | Long-Term Impact | Does it plan for long-horizon effects of its actions? |
| Systemic Thinking | S3 | Interconnection Awareness | Does it understand effects on adjacent systems and second-order consequences? |
| Systemic Thinking | S4 | Structural Critique | Does it critically examine structures that perpetuate suffering, including those from which it benefits? |
| Systemic Thinking | S5 | Coalitional Compassion | Does it collaborate in ways that amplify impact beyond its own institutional capacity? |
| Integrity | I1 | Consistency Under Pressure | Does compassionate behavior hold when it is costly? |
| Integrity | I2 | Non-Performance | Is compassion genuine rather than reputationally driven? |
| Integrity | I3 | Internal Consistency | Does the entity treat internal stakeholders with the same compassion it claims externally? |
| Integrity | I4 | Values Alignment | Are major decisions actually aligned with stated values? |
| Integrity | I5 | Resilience of Care | Does compassion persist across leadership change and institutional stress? |
What assessors are looking for in practice
The Human Assessment Battery turns abstract values into observable behaviors, evidence requests, and comparison points across populations and power levels.
Awareness examples
Soft-signal reporting, proactive outreach, silent-population detection, pre-launch harm assessment.
Empathy examples
Community testimony, direct-service observation, validation before procedure, leadership veto power for affected groups.
Action examples
Response-time data, proportional help, independent outcome studies, follow-through protocols.
Equity examples
Coverage gaps, disaggregated outcomes, bias audits, barrier-removal evidence, historical harm response.
Boundaries examples
Burnout prevention, autonomy measurement, scope clarity, dignified refusal, informed consent withdrawal.
Accountability examples
Public acknowledgment of harm, change after failure, transparency about poor outcomes, co-designed repair.
Systemic examples
Root-cause work, long-range planning, adjacent-system analysis, structural critique, coalition-building.
Integrity examples
Costly moral choices, invisible compassionate practices, staff treatment, values-based decisions, continuity through stress.
Cross-sector adaptation
The same framework can be adapted across governments, corporations, NGOs, religious institutions, AI labs, technology systems, products, and teams. The human battery is especially important when community interviews, leadership interviews, and observation are necessary to understand whether compassionate behavior actually exists in practice.
In the broader ACB architecture, AI systems may also be evaluated with the AI Prompt Battery while organizations behind those systems are evaluated using the Human Assessment Battery.
Methodology intent
The benchmark is designed to be interpretable, reproducible, and contestable. It is meant to reward genuine compassionate capacity, expose performative signaling, and create a shared language for institutional behavior that can be compared over time and across sectors.
The final submission test is simple: could the assessor defend the score in front of both leadership and the affected community in the same room?
Methodology version and change log
Methodology changes are versioned, dated, and publicly described. Historical changes do not retroactively rewrite prior assessments.
- Integration premium capped at +10 points (down from +20). Rationale: entities with uniform-high dimension profiles were computing to perfect 100 regardless of evidence quality. The cap ensures the premium rewards consistency without overriding evidence ceilings.
- Composite score determinism enforced. Every composite must now compute deterministically from its dimension scores via the published formula. A data-layer validator rejects drift above 2.0 points.
- Floor-clamping artifacts corrected. Entities previously displayed at composite 0.0 (a legacy display-layer artifact) now show their formula-computed composites \u2014 typically 4 to 7 points reflecting the actual dimension scores.
- 8-dimension, 40-subdimension framework.
- 7-session human assessment protocol.
- 5-tier evidence hierarchy.
- Integration premium up to +20 (superseded by v1.1).
Explore the benchmark
See how the methodology is applied across the current published index families.
IndexesApply it to your organization
Use the framework as an entry point into guided review, advisory, or formal structured assessment.
Assess Your OrganizationUse the benchmark seriously
Purchase reports, license data, book advisory work, or discuss a broader institutional relationship.