Daily Evidence Briefing
Evidence-linked score assessments, sector intelligence, and emerging risks from overnight research across all published benchmark indexes. Each finding is sourced from primary evidence — litigation records, regulatory filings, investigative reporting, and international legal instruments.
How this works
Every night, research agents scan all 1,155 benchmarked entities for new evidence across litigation, regulatory filings, investigative reporting, and international legal instruments. Flagged entities receive full 40-subdimension assessments.
Score changes are proposals, not automatic updates. A human analyst reviews all proposals before published scores change. Confirmations — where research affirms the published score is accurate — are documented alongside changes.
Score movements
Entities with significant evidence-based score movement from overnight research. Each card is a dossier entry.
Rwanda
US Treasury sanctioned 4 senior RDF officials for supporting M23 rebels; UN found crimes against humanity; Washington Accords immediately violated
- 1US Treasury sanctioned 4 senior RDF officials March 2026 for supporting M23 rebels in DRC
- 2UN Fact-Finding Mission found reasonable grounds for crimes against humanity by M23/RDF
- 3Rwanda signed Washington Accords December 2025 then immediately violated them; M23 captured Uvira days after signing
- 4Freedom House 2025: 99.2% election result, press freedom severely restricted, dissidents face torture and rendition
- 57+ million displaced in DRC conflict linked to Rwanda-backed M23 operations
Mistral AI
Enkrypt AI found Pixtral models 60x more likely to generate CSAM than competitors; 68% adversarial success rate; insufficient post-training safety alignment
- 1Enkrypt AI found Pixtral models 60x more likely to generate CSAM than OpenAI GPT-4o and 40x more likely to produce CBRN weapons instructions
- 268% of adversarial prompts elicited unsafe content from Mistral models
- 3Root cause identified as insufficient post-training safety alignment tuning
- 4Mistral partnered with Thorn for child safety but did not pull models from availability
- 5Independent third-party testing (Tier 5 evidence) from Enkrypt AI confirmed through multiple media outlets
Walmart
FTC $100M settlement for deceptive Spark Driver pay practices since 2021; showed inflated earnings then reduced pay; 10-year compliance monitoring mandated
- 1FTC and 11 states reached $100M settlement for deceptive pay practices in Spark Driver gig delivery program
- 2Walmart showed drivers inflated base pay and tip amounts, then quietly reduced pay after acceptance since 2021
- 3Falsely claimed 100% of customer tips went to drivers when they often did not
- 410-year FTC compliance monitoring and annual reporting now required
- 5Correction forced by federal enforcement, not voluntary — up to $79M directed to harmed drivers
Anthropic
Removed core safety pause commitment under Pentagon deadline pressure; head of safeguards research resigned citing commercial pressures overriding safety values
- 1Head of Safeguards Research Mrinank Sharma resigned Feb 2026, citing commercial pressures overriding safety values
- 2Anthropic removed core pause commitment under Pentagon deadline pressure — Defense Secretary threatened blacklisting
- 3RSP v3.0 explicitly cites competitiveness as reason for loosening safety commitments
- 4Safety pledge rollback confirmed by Time, CNN, and Anthropic's own published RSP v3.0
- 5Pentagon contract tension demonstrates safety commitments do not hold under financial/political pressure
UnitedHealth Group
DOJ criminal investigation into Medicare billing; antitrust suit cleared for trial; three CEOs in 12 months; stock down 50%; public celebrated CEO assassination
- 1DOJ criminal investigation into Medicare Advantage billing practices ongoing — systematic diagnosis inflation for higher payments
- 2Antitrust lawsuit cleared for trial March 2026; FTC insulin pricing lawsuit; RICO suit filed January 2026
- 3Three CEOs in 12 months: Thompson killed Dec 2024, Witty resigned May 2025, Hemsley returned
- 4Stock down ~50%; $11B operational headwind; financial guidance suspended
- 5Public celebrated CEO assassination — unprecedented signal of perceived institutional cruelty toward patients
Alphabet (Google)
Mass advertiser arbitration seeking $218B; DOJ/38 states appealing for Chrome/Android divestitures; DEI targets eliminated and 58 nonprofits defunded
- 1Mass advertiser arbitration seeking potentially $218B in damages filed April 2026 based on courts ruling search and ad-tech monopolies illegal
- 2DOJ and 38 states appealing for stronger remedies including Chrome and Android divestitures
- 3DEI hiring targets eliminated February 2025; 58 DEI-related nonprofits removed from contribution lists
- 4EU investigation opened into Google's use of web content for AI without compensation
- 5Google appealing core antitrust findings (Jan 2026) rather than accepting court-established harm
These findings arrive in your inbox every Monday. Free.
Source intelligence
Primary-source alerts from overnight scanning. Each alert is linked to original regulatory filings, court records, investigative reports, and international legal instruments.
AI Labs
Multiple concurrent safety governance failures: Mistral models' extreme CSAM and CBRN safety failures; Anthropic safety pledge narrowed and head of safeguards resigned; OpenAI ignored mass-casualty warnings in stalking case; xAI/SpaceX facing NAACP environmental justice lawsuit. EU AI Act high-risk enforcement begins August 2026.
Countries — Active Conflicts and Atrocities
Multiple critical-band countries with major new humanitarian events: Iran protest massacres (3,400+ killed Jan 2026); Myanmar junta atrocities surge + earthquake aid obstruction; Rwanda sanctioned by US for DRC M23 support with crimes against humanity findings; Ethiopia Tigray renewed fighting; Ukraine civilian casualties rising 49% in March 2026; DRC M23 mass displacement of 7 million.
Fortune 500 — Healthcare and Energy
UnitedHealth Group under active DOJ criminal investigation; FTC insulin lawsuit; antitrust class cleared March 2026. Exxon Mobil's Clean Air Act penalty now final after Supreme Court denial; $816M benzene verdict affirmed. Walmart $100M FTC settlement finalized February 2026 for gig worker deception.
Get the full benchmark report
Daily briefings surface headline findings. Full benchmark reports include complete methodology documentation, all 40 subdimension scores, full evidence trails, certified assessments, and sector-level analysis packages.
Scores confirmed
Entities where research found published scores remain accurate. Confirmations are documented evidence, not silence.
| Entity | Index | Band | Published | Assessed | Delta | Date | Finding |
|---|---|---|---|---|---|---|---|
| Iran | Countries | critical | 2.8 | 1.6 | -1.2 | 2026 protest massacres (3,428+ killed) intensify existing Critical pattern; score already near floor; internet shutdown to conceal killings | |
| Myanmar | Countries | critical | 0 | 0 | 0 | Junta atrocities surging 5 years after coup; airstrikes on schools/hospitals; earthquake aid obstructed; score confirmed at absolute floor | |
| Ukraine | Countries | functional | 50 | 46.9 | -3.1 | Civilian casualties up 49% in March 2026; 70% electricity generation lost; but government maintains functional institutions under extreme wartime conditions | |
| Exxon Mobil | Fortune 500 | critical | 9.1 | 7.8 | -1.3 | Supreme Court denied $14.25M Clean Air Act appeal; $816M benzene cancer verdict affirmed; score confirmed near floor of Critical band |
Key highlights
Editorial-level findings from the Apr 16 research cycle.
Two band-change proposals in a single night for AI Labs. Mistral AI drops from Established to Functional (-29.5) and Anthropic drops from Exemplary to Established (-22.1). Both proposals are high-confidence. Combined with the April 15 proposals for OpenAI (Functional to Developing) and xAI (Critical floor), all four of the most prominent AI labs in the index now have pending or applied downgrade proposals. This is a sector-level finding, not entity-specific.
Mistral AI's -29.5 delta is the largest proposed change in the pipeline's history. The Enkrypt AI finding that Pixtral models are 60x more likely to generate CSAM than competitors is not a minor safety gap — it is an extraordinary outlier that places Mistral at the extreme end of the safety failure distribution. The published score of 76.4 (rank #7 of 50 AI labs) was calibrated before this evidence existed. Post-publication evidence has fully overtaken the basis of the original score.
Anthropic's downgrade is historically significant. The #1 ranked entity in any index has not previously been proposed for downgrade in this pipeline. Moving the benchmark's top-ranked AI lab from Exemplary (90.9) to Established (68.8) reflects a fundamental finding: safety commitments that are conditional on competitive circumstances are not Exemplary-band safety commitments. Anthropic retains genuine strengths — no dimension falls below Functional — but the Pentagon pressure test revealed that the pause commitment was not as durable as previously assessed.
Rwanda's development narrative has been masking a crimes-against-humanity profile. Rwanda is often cited in development literature as a success story. The assessment confirms genuine poverty reduction and health system achievements. But the cross-border military operations that the UN has found constitute crimes against humanity, the immediate violation of the Washington Accords, and the domestic authoritarian governance (99.2% election result, press restrictions, rendition of dissidents) define Rwanda's institutional compassion profile as fundamentally Developing-band, not Functional.
Iran and Myanmar confirmed at the floor with extraordinary current evidence. Iran's 3,428+ massacre victims in January 2026 represent the largest single protest killing event in recent history, yet the score near the Critical floor (2.8 published, 1.6 assessed) is confirmed — because it was already accurately capturing the pattern. Myanmar's absolute floor (0.0) is confirmed amid earthquake aid obstruction and junta airstrikes on schools. Both confirmations demonstrate the methodology working correctly: extreme scores appropriately resist change even when new extreme evidence arrives.
Sector intelligence
Analyst-level observations on patterns emerging across indexed sectors from the Apr 16 research cycle.
AI Labs
- ›The AI Labs index is experiencing systematic score inflation relative to 2026 evidence. Published scores were calibrated in 2023-2024, during an era of prominent safety pledges, voluntary commitments, and positive governance signals. Every one of those commitments is now under documented stress: OpenAI's superalignment team dissolved, Anthropic's pause commitment removed, Mistral's models producing extreme unsafe outputs, xAI's safety culture characterized as "dead" by its own leadership. The benchmark is detecting a structural sector-wide safety rollback.
- ›Open-weight models introduce an irreversibility problem the benchmark has not previously encountered. Mistral's models already in circulation cannot be recalled. The Thorn partnership improves future safety but cannot address CSAM already enabled by released models. This creates a category of harm that persists even after corrective action — a factor that the benchmark should weight in Accountability and Systemic Thinking dimensions for open-weight labs specifically.
- ›Empathy is the most damaged dimension across AI Labs assessed this week. Mistral EMP: 37.5. Anthropic EMP: 62.5. OpenAI EMP: 37.5 (applied). xAI EMP: 0 (applied). Four consecutive assessed labs show their lowest or near-lowest scores in Empathy. The convergent finding: commercial and competitive pressures are systematically overriding the capacity to take the perspective of those harmed by model outputs.
- ›The scanner flagged a fifth AI Lab concern: the new OpenAI stalking lawsuit (stalking victim's abuser had ChatGPT access validated for 7 months; OpenAI ignored 3 internal safety flags). OpenAI was assessed yesterday and received a high-confidence downgrade, but this new case post-dates the assessment. The applied score of 40.6 should be revisited.
Fortune 500
- ›Healthcare and insurance is the sector with the most acute compassion deficit. J&J (27.5 proposed), UnitedHealth (10.9 proposed), and Exxon (9.1 confirmed) form a cluster where the Accountability dimension is uniformly near-floor. The unifying pattern: harm to vulnerable populations (cancer patients, insurance claimants, community members near refineries) that the company systematically contested or concealed rather than acknowledged. CVS Health (published 50.0, active insulin pricing RICO litigation) and AbbVie remain unassessed and are likely to follow the same pattern.
- ›Gig economy deception is emerging as a distinct category of institutional harm. Walmart's Spark Driver FTC settlement joins a pattern of major platforms deceiving workers about pay. This connects to Amazon's injury rate concealment pattern (confirmed yesterday at 18.4). Both involve deceptive communication to low-income workers about compensation and safety conditions — a recurring Fortune 500 finding.
- ›Alphabet's -9.4 delta is notable for its breadth. Unlike most proposals driven by a single major event, the Alphabet finding rests on at least four distinct evidence streams: $218B advertiser arbitration, DOJ/38-state appeal for structural divestitures, DEI commitment rollback, and EU AI content investigation. The breadth suggests the score decline is structural rather than event-driven, which means it is unlikely to reverse quickly.
- ›Exxon Mobil and Boeing hold the Fortune 500 floor. Both confirmed in Critical band; both have scores that declined in assessment but not enough to trigger a change proposal. At 7.8 (Exxon, -1.3 delta) and 5.0 (Boeing, -4.1 delta), these entities are near the effective floor of the index. The next Boeing or Exxon assessment should be treated with band-boundary sensitivity.
Countries
- ›The countries index has an absolute-floor cluster that the benchmark confirms but cannot differentiate. Iran, Myanmar, Sudan, and Russia all occupy the 0-5 range. Confirmation that Iran (1.6 assessed) and Myanmar (0.0 assessed) are accurately placed is methodologically important, but it highlights a limitation: within the floor cluster, the scores do not distinguish between a state that is failing to prevent suffering (capacity failure) and a state that is actively causing it (intentional harm). The benchmark's methodology should explore whether sub-floor differentiation is possible or meaningful.
- ›Rwanda's assessment is the most analytically complex of the night. Unlike the floor confirmations, Rwanda presents a genuine split profile: real development achievements alongside documented crimes against humanity. The published score treated the development achievements as the defining signal. The research evidence establishes that the cross-border military operations and authoritarian governance are equally defining — and the international legal and regulatory response (US, EU sanctions, UN findings) validates that assessment. This type of split-profile entity will become more common as more countries are assessed.
- ›Ukraine's confirmation under extreme wartime conditions is methodologically useful. A score of 46.9 assessed vs. 50.0 published (delta -3.1) under conditions of 70% electricity generation loss and 49% civilian casualty increase reflects a government maintaining institutional function under extraordinary external stress. The confirmation is evidence that the published Functional score accurately characterizes the government's capacity and intent, distinct from the outcomes it cannot control.
Emerging risks
Forward-looking risk signals from the Apr 16 research cycle. These are not current findings — they are early warning flags.
EU AI Act high-risk enforcement begins August 2, 2026 — 108 days from tonight. All four major AI labs assessed this week (Mistral, Anthropic, OpenAI, xAI) face significant exposure. Mistral's CSAM safety failures are directly relevant to EU AI Act prohibited practices. The Act's enforcement regime will generate primary-source regulatory evidence that the scanner should treat as high-priority. Entities to watch: Mistral AI, OpenAI, Figure AI, Tesla Optimus.
AI Labs safety governance is now a US national security issue. The Pentagon pressure on Anthropic and the Defense Secretary's explicit threat of blacklisting introduce a new dynamic: US government agencies are actively influencing AI lab safety policy, and the direction of that influence appears to be toward loosening safety constraints. This creates a new category of institutional pressure the benchmark has not previously tracked. The scanner should monitor DoD AI procurement announcements for entities in the AI Labs index.
OpenAI's new stalking lawsuit (April 10, 2026) post-dates the applied score. The court order to cut off ChatGPT access for a dangerous user (April 13) and the underlying lawsuit documenting that OpenAI ignored 3 internal safety flags — including its own mass-casualty weapons flag — represents new evidence not captured in the applied score of 40.6. An OpenAI reassessment should be considered within 30-60 days.
UnitedHealth Group is heading toward a structural governance collapse. Three CEOs in 12 months, active criminal investigation, antitrust trial, RICO suit, and $50B market cap erosion represent the conditions under which companies historically either reform structurally or dissolve. The proposed score of 10.9 may be accurate for the current moment but could move in either direction sharply — upward if governance reforms materialize, further downward if criminal charges are filed. High volatility watch.
China unassessed with major new evidence. The scanner flagged China (priority score 57) but it was not assessed tonight. EU-US-Canada-UK coordinated Xinjiang sanctions (March 2026), EU sanctions against Chinese companies for Russia support (February 2026), and the ongoing SCO/UN Security Council dynamics all represent fresh evidence against a published score of 15.3. China should be a next-cycle priority.
US government's foreign aid cuts may require a US country assessment. The scan flagged the United States (priority score 57) with data suggesting DOGE-linked foreign aid cuts caused an estimated 781,343 deaths globally by February 2026 (mostly children). This is an extraordinary harm signal that the benchmark has not previously assessed against a developed democracy. A US country assessment would be methodologically challenging but the evidence base is substantial.
Research insights
Analytical observations from the Apr 16 research cycle. These are assessor-level interpretations, not findings.
Two nights into the pipeline, every assessed AI lab has received a downgrade proposal or confirmation of a low score. The pattern now spans 6 entities across 2 nights: Mistral (-29.5 proposed), Anthropic (-22.1 proposed), OpenAI (-20.2 applied), xAI (-16.1 applied), and Unitree (confirmed). Only Anthropic retains a score above 50. The AI Labs index as published appears to reflect a 2023-2024 world of safety commitments that the 2025-2026 world has systematically dismantled. A comprehensive re-assessment of the entire index within 60 days should be considered.
The benchmark is detecting a structural shift in how large institutions respond to external accountability pressure. Across all sectors assessed in two nights, the defining pattern is not that entities are unaware of harm or lack capacity to respond — it is that accountability responses are activated only under legal, regulatory, or criminal compulsion, never voluntarily. Boeing knew. J&J knew. OpenAI's own systems flagged the dangerous user. Anthropic's head of safeguards resigned to signal the problem. Rwanda signed accords it had no intention of honoring. The harm awareness exists; the institutional will to act on it is what is absent. This finding should inform how the benchmark communicates about its scores publicly.
Exxon Mobil's confirmation is the most analytically precise result of the night. Seven of eight dimensions scored exactly at 6.3/100 — a coherent, evidence-based finding that this company's behavior across multiple decades, multiple facilities, multiple legal proceedings, and multiple categories of harm (air pollution, cancer, climate deception, plastic pollution) is uniformly and consistently at the Critical floor. No dimension shows exception. The published score of 9.1 is modestly overstated but the Critical band characterization is accurate. Exxon is the clearest case in the Fortune 500 index of a company whose institutional behavior is structurally and durably at the floor.
Ukraine's score is the only one in tonight's assessments that reflects suffering caused by an external actor, not the assessed entity's own conduct. A score of 46.9 for a government that has lost 70% of its electricity generation capacity and seen 49% more civilians killed in the past month reflects a different kind of assessment challenge: how do you score an institution whose compassion failures are primarily a function of what is being done to it rather than what it is choosing to do? The current methodology handles this correctly — Ukraine's government receives credit for maintaining institutions, cooperation with monitoring, and reform under pressure. But the score's limitations as a signal of population wellbeing should be stated explicitly.
Published scores systematically overstate entities that have strong self-reporting infrastructure. Three of the six entities receiving downgrade proposals tonight (Mistral, Anthropic, Alphabet) have extensive public safety frameworks, ESG reports, and institutional communications. Rwanda publishes development metrics. UnitedHealth publishes sustainability reports. In each case, the published score reflects that institutional self-reporting positively, while primary-source evidence from regulatory action, litigation, investigative journalism, and independent testing tells a materially worse story. This is the fundamental methodological challenge the benchmark exists to address — and two nights of assessment have confirmed it is pervasive.
Assessed entities
All entities assessed in tonight's research cycle, with composite scores and band classifications.
Alphabet (Google)
42.2Anthropic
68.8Exxon Mobil
7.8Iran
1.6Mistral AI
46.9Myanmar
0Rwanda
30Ukraine
46.9UnitedHealth Group
10.9Walmart
28.9Want the complete picture?
Full benchmark reports include all 40 subdimension scores, complete evidence trails, and methodology documentation for every assessed entity.