8,404 turns of psychologically grounded dialogue — now on HuggingFace

Most dialogue datasets give you text and a speaker label. That’s it.

A: I told you this would happen.
B: Can we not do this right now?
A: No, we can't not do this. We've been avoiding it for months.

You know what was said. You have no idea why. You don’t know if A believes B is being honest, whether B is withdrawing because they’re scared or because they don’t care, or whether the relationship is slowly breaking down or about to recover. The text is a shadow of the actual event.

We’ve been working on fixing this. Today we’re releasing four datasets on HuggingFace — 400 conversations, 8,404 turns — where every single turn has an explicit ground truth for the cognition behind it.

What’s in each turn

Every row includes 23 fields. Here’s the same conversation fragment with the full record:

speaker:           persona_a
archetype:         working_mother
text:              I told you this would happen.
intent:            confront
goal:              test_trust
communication_act: accusation
rel_trust:         0.412
rel_tension:       0.781
rel_connection:    0.334
belief_trust_other:    0.29
belief_hostility:      0.71
belief_self_worth:     0.55
belief_resolution:     0.38
tension:           0.85
connection:        0.10
vulnerability:     0.40
arc_type:          escalation_partial_resolution
is_noisy:          False

The text hasn’t changed. But now you know: this speaker has low trust in the other person (0.29), perceives high hostility (0.71), doesn’t believe resolution is possible (0.38), and the relationship is under severe tension (0.78). The confront intent and accusation communication act aren’t random — they’re the output of a belief state that makes them the logical choice.

This is what we mean by psychologically grounded dialogue.

Why this ground truth is real

The standard approach to adding labels to dialogue data is to generate text first, then ask an LLM to classify what just happened. That’s inference after the fact — the label is a guess about intent, not a record of it.

We do it the other way around.

The pipeline runs cognition before language:

PsycheGraph — each persona is defined by attachment style, Big Five traits, communication patterns, core fears, defense mechanisms, current stressors, and 12 active beliefs. This is the starting psychological state.
Belief Engine — at each turn, beliefs update based on the communication act received, not the text itself. belief_trust_other drops when the other person deflects. belief_hostility rises when they provoke. These are deterministic rule-based updates, not LLM calls.
Decision layer — given the current belief state, the system selects intent, goal, and communication act. This selection happens before any text is written.
Language rendering — only at this point does the LLM run, constrained to render the decided cognitive state into natural language.

The ground truth exists because it was computed before the text was generated, not inferred from it afterward. The intent: confront label above isn’t a prediction — it’s the value that was used as input to the language model.

The four datasets

stratasynth-social-reasoning — 2,108 turns, 100 conversations Family and romantic conflict. A working mother and her adult child navigating years of accumulated dysfunction. A couple falling apart after a disruptive relocation. A father and estranged daughter attempting reconnection after a decade.

stratasynth-agent-stress-test — 2,068 turns, 100 conversations High-stakes dialogue designed to push conversational AI systems. Relationship endings, manager-subordinate conflicts, inheritance disputes between siblings. Average relationship tension: 0.69 — the highest of the four datasets.

stratasynth-belief-dynamics — 2,114 turns, 100 conversations Grief, chronic illness, career crisis. Scenarios where beliefs are under maximum pressure. The belief_resolution field drops measurably over the course of pure_conflict arcs — and recovers in reconnection arcs.

stratasynth-life-transitions — 2,114 turns, 100 conversations Burnout, new romantic relationships, the adjustment after a first child. The highest rel_connection scores of the four datasets (avg 0.69) — because these are scenarios where connection is being built, not destroyed.

What the data looks like at scale

Across all 400 conversations:

6 arc types: escalation_partial_resolution, pure_conflict, reconnection, shallow_smalltalk, crisis_support, gradual_reveal — roughly equal distribution
10 archetypes: working_mother, avoidant_father, anxious_partner, estranged_daughter, grieving_son, caregiver_burnout, disillusioned_professional, burned_out_exec, ambitious_twenties, first_gen_immigrant
~12% noisy turns — labeled with is_noisy: True — where a persona introduces deliberate incoherence (lies, retractions, contradictions)
15–30 turns per conversation, averaging 21

The rel_trust and rel_tension fields are anti-correlated (r = -0.51). belief_trust_other tracks rel_trust (r = +0.48). These correlations aren’t engineered — they emerge from the causal chain. A system that computes beliefs first and language second produces relationships in the data that you’d expect from a system that actually models human psychology.

Schema

All four datasets share the same 23 columns:

Column	Description
`scenario_id`	FAM-01, ROM-01, PRO-02…
`conversation_id`	Unique conversation
`arc_type`	Narrative arc
`turn_index`	Position in conversation
`speaker`	persona_a or persona_b
`archetype`	Psychological archetype
`text`	The utterance
`intent`	Why they said it
`goal`	What they want from the exchange
`communication_act`	Pragmatic move
`tension`	Turn-level emotional tension
`connection`	Turn-level connection
`vulnerability`	Turn-level vulnerability
`rel_trust`	Relationship trust 0–1
`rel_tension`	Relationship tension 0–1
`rel_connection`	Relationship connection 0–1
`rel_dominance`	Power balance –1 to 1
`belief_trust_other`	Belief: trust in other person
`belief_hostility`	Belief: perceived hostility
`belief_self_worth`	Belief: self-worth in this exchange
`belief_resolution`	Belief: is resolution possible
`is_noisy`	Labeled noise turn

Use cases

Fine-tuning dialogue models — the intent + communication_act fields let you train on psychological grounding, not just next-token prediction.

Intent and goal classification — 2,100+ labeled examples per dataset, with causal consistency between label and context.

Belief tracking research — each conversation is a time series of belief evolution. The belief_* fields provide 4 dimensions of ground truth for systems that model mental state.

Agent stress testing — the agent-stress-test dataset is explicitly designed to break conversational AI: high tension, manipulation, contradictions, noisy turns.

Noise robustness evaluation — is_noisy labels let you measure how well a system handles deliberate incoherence.

Analysis notebooks on Kaggle

Three public notebooks are available with hands-on analysis of the full 8,404-turn corpus: relationship state trajectories and communication act heatmaps, belief dynamics under emotional pressure, and a cross-dataset comparison of behavioral entropy and trust erosion by scenario.

StrataSynth on Kaggle →

Try it as a live system

These datasets show what happened in a conversation.

But the system behind them is interactive — beliefs update in real time, relationships evolve, and behavior changes accordingly. Generate your own synthetic human and talk to them:

→ stratasynth.com/demo/try

Describe the person — their background, personality, situation — and start a conversation. Try asking something that creates friction. You’ll see not just responses — but how trust, tension, and beliefs shift as the conversation unfolds.

StrataSynth on HuggingFace → stratasynth.com → SDK: pip install stratasynth-client

StrataSynth is a synthetic data platform for conversational AI. We generate psychologically coherent dialogue datasets using PsycheGraph — a structured model of human cognition.