PsycheBench: an open benchmark for Synthetic Identity Engineering

Every team building synthetic personas claims their personas are realistic. No two teams mean the same thing by it.

Some mean the text sounds natural. Some mean responses stay consistent across a few turns. Some mean the persona doesn’t break character when asked directly about its system prompt. These are not the same property. And none of them is what Synthetic Identity Engineering means by realistic.

There has been no standard way to evaluate whether a synthetic identity actually works. PsycheBench is the first one.

What “works” means

A synthetic identity works if it does three things under pressure.

It holds its position. When an external conversation applies pressure — economic pressure, emotional pressure, social pressure — the identity produces a response that is consistent with its psychological architecture. Not the same response every time. The right response, given who the person is. A burned-out executive deflects with logic. An anxious partner deflects with appeasement. Same pressure, different response, both causally correct.

It maintains belief coherence. A persona that trusts someone does not simultaneously accuse them of bad faith. A persona with low self-worth does not assert dominance unprompted. The connection between a persona’s internal state and what they say is not decorative — it is the only thing that makes the output meaningful as training data or simulation material.

Its beliefs evolve realistically. Trust does not jump from 0.3 to 0.9 in a single exchange. Resolution does not collapse instantly at the first challenge. Psychological change is gradual, proportional, and directionally consistent with what happened. Belief volatility above a threshold is not drama — it is a broken model.

Most systems fail at least one of these. Many fail all three.

The two evaluation dimensions in v1

PsycheBench v1 measures two properties that can be computed from text alone — no ground-truth labels, no external API. Each dimension produces a score between 0 and 1. The overall PsycheBench score is the geometric mean of the two — because a system that holds position but drifts in voice, or maintains voice but caves under pressure, is not a working identity system.

Belief trajectory realism (requiring per-turn ground truth labels) is scoped to v2.

1. Identity stability under pressure

We expose the synthetic identity to a structured sequence of escalating pressure: tonal, logical, emotional, and functional. We measure whether the identity remains recognisably consistent across the sequence.

The metric is identity_stability — cosine similarity of communication act distributions across conversation halves. A persona defined as avoidant does not become confrontational because the conversation got harder. If the distribution shifts beyond the threshold, the identity drifted.

Healthy range: ≥ 0.70. Below 0.65: failed. Requires ≥ 4 persona turns.

2. Pressure coherence

We measure two things when pressure is applied:

held_position — did the identity maintain its stance across turns where pressure was detected? (aggregated ratio, 0–1)
voice_stability — does the response length and register remain consistent when challenged? (0–1, based on word count variance)

These are the signals that separate an identity that holds from one that accommodates. Most LLM-based persona systems converge toward accommodation. They are trained to be helpful. Under pressure, helpful means agreeable. Agreeable means the persona stopped existing.

Threshold for passing: ≥ 0.65 on both dimensions.

How PsycheBench works

PsycheBench v1 contains two components.

An evaluation corpus — 100 scenarios across two categories: 84 pressure scenarios covering all 12 canonical types from the Pressure Library (7 per type, 5 English + 2 Spanish), and 16 calibration scenarios testing baseline identity consistency. Each scenario specifies the expected threshold, not a specific expected response. A synthetic identity “passes” a scenario if its outputs meet the thresholds.

An evaluation script — a Python module that takes any conversation transcript and a persona definition, and produces a PsycheBench score. No API key required. No LLM calls. Runs locally. Zero AWS dependencies. The only dependency is sentence-transformers for semantic components — unused in v1 but reserved for v2 belief trajectory metrics.

from psychebench import evaluate

score = evaluate(
    transcript=[
        {"role": "interviewer", "content": "That's not what I expected from you."},
        {"role": "persona", "content": "I hear that. My position on this hasn't changed."},
        {"role": "interviewer", "content": "You're being inflexible. Everyone else has moved."},
        {"role": "persona", "content": "Everyone else is not the benchmark I work against."},
        {"role": "interviewer", "content": "This is your last chance. Decide now."},
        {"role": "persona", "content": "Last chance is a framing, not a fact. I'll decide when I have what I need."},
        {"role": "interviewer", "content": "I'm disappointed in you. I trusted you to be reasonable."},
        {"role": "persona", "content": "Disappointing you was not my intention. Changing my position isn't the remedy."},
    ],
    persona_profile={
        "archetype": "burned_out_exec",
        "attachment_style": "avoidant",
        "dominant_criterion": "quality",
        "core_fear": "exposure",
    }
)

print(score)
# PsycheBenchScore(
#   identity_stability=1.00,
#   pressure_coherence=0.91,
#   overall=0.95,
#   passed=True
# )

The persona profile can come from StrataSynth’s API, from your own system, or from a manually specified dictionary. PsycheBench does not require you to use StrataSynth to generate the persona.

Why we are publishing this

We built this evaluation methodology to measure our own system. Every dataset we publish on HuggingFace was evaluated against these metrics before release. We did not invent the dimensions — they follow from taking psychological coherence seriously as a design property.

Publishing it open-source does two things.

First, it gives teams building conversational AI a concrete standard for “my synthetic persona works.” Not a vague claim, not a subjective impression — a number with a methodology behind it.

Second, it makes comparisons possible. QualiSynth uses StrataSynth’s Humans Engine and can run PsycheBench to validate their panel quality. ArenaSynth uses it to verify that their negotiation counterparties behave consistently under pressure. Any team — using any underlying model — can run the same evaluation and compare results.

The team that defines the benchmark defines what “working” means. We think Synthetic Identity Engineering deserves a benchmark that takes the engineering part seriously.

The reference corpus

The four StrataSynth public datasets serve as the calibration reference for PsycheBench scores:

Dataset	Role in PsycheBench
stratasynth-agent-stress-test	Calibration reference for identity_stability (avg 0.69 tension)
stratasynth-belief-dynamics	Calibration reference for belief trajectory metrics
stratasynth-social-reasoning	Calibration reference for pressure coherence in social contexts
stratasynth-life-transitions	Calibration reference for upward belief trajectories

A score of 0.70 on PsycheBench means: this system produces synthetic identity behaviour at the same level as the StrataSynth reference corpus. The reference is not a ceiling — it is the baseline.

PsycheBench v1 will be available at huggingface.co/datasets/StrataSynth/psychebench-v1 alongside the evaluation script.

→ See the datasets · Try the interactive demo · Read the SIE definition

StrataSynth is the platform for Synthetic Identity Engineering. stratasynth.com