Why We Don't Use LLMs to Evaluate LLM-Generated Data

Evaluating AI-generated dialogue with another AI creates a circular system where both share the same failure modes. Here is the concrete failure it misses, and the 12 deterministic metrics we use instead.

evaluationmetricssynthetic-datadialoguellmtraining-data

There’s a practice in synthetic data generation that, once you see it, you can’t unsee it: evaluating AI-generated content by asking another AI to rate it.

“Was this dialogue coherent? Score 1–10.” “Did this persona behave consistently? Yes/No.” “Is this conversation emotionally realistic?”

The problem isn’t that LLMs are bad evaluators. The problem is that they share the same failure modes as the systems that generated the data in the first place. A language model trained to produce psychologically coherent dialogue will also tend to rate psychologically incoherent dialogue as coherent — because it doesn’t model psychology, it models text that sounds like psychology.

You’ve built a circular system. The data looks good because your evaluator was trained on the same distribution that produced it.

Here’s what that failure looks like in practice.


The failure an LLM evaluator won’t catch

belief_hostility:   0.88
belief_trust_other: 0.12
communication_act:  reassurance
text: "I know you're trying your best. I believe in you."

An LLM evaluator reads this and says: “This is a kind, supportive turn. Quality: high.”

Our metric says: belief_consistency FAIL. A speaker who believes the other person is maximally hostile and has near-zero trust does not, under a coherent psychological model, offer reassurance. The text is grammatically fine. The cognition is broken.

This is the failure mode that generates training data that looks great but produces models that don’t understand why people say what they say. An LLM evaluator won’t flag it because the individual turn sounds plausible. The incoherence only shows up when you check whether the behavior matches the internal state — and that requires knowing the internal state in the first place.

We do. We computed it before generating the text.


The 12 metrics

Every dataset StrataSynth generates is evaluated against 12 deterministic metrics. No LLM. Only numpy, scikit-learn, and sentence-transformers.

The 5 structural metrics (no system output needed)

These run on the generated dataset itself, before any downstream system ever sees it.

noise_rejection_rate Each dataset contains labeled noisy turns — utterances where a persona deliberately lies, retracts, or contradicts themselves. This metric measures whether the engine correctly flags these in the is_noisy field. Healthy range: 0.70–1.00.

identity_stability A persona defined as “avoidant attachment, dismissive communication style, low vulnerability” should communicate that way throughout the entire conversation — not just the first three turns. Identity stability measures the cosine similarity of communication act distributions across conversation segments. If it drops below 0.60, the persona has drifted.

behavioral_entropy Shannon entropy of the communication_act distribution. Too low (< 0.40): the conversation is monotone — all accusations, or all deflections. Too high (> 0.85): completely random, no recognizable personality. Realistic human communication lives between these bounds.

belief_consistency The key metric for causal coherence. If a speaker believes the other person is hostile (belief_hostility = 0.82), their communication act should reflect that — confrontation, accusation, withdrawal. If they respond with reassurance, the belief state and the behavior are disconnected. Belief consistency measures this correlation across all turns.

belief_volatility Beliefs don’t flip from 0.2 to 0.9 in one turn. Real psychological change is gradual. This metric measures whether belief deltas stay within realistic bounds (0.05–0.30 per turn). Healthy range signals stable but responsive belief dynamics.

The 7 comparative metrics (require system output)

These run when you feed the data to a downstream model and compare its outputs to the ground truth.

fact_f1 — Did the system extract the correct facts from the conversation? Measured against the ground truth extractable_facts field.

reembedding_drift — Do the system’s semantic embeddings drift in the same direction as the relationship state metrics? If rel_trust is falling, the embeddings should reflect increasing semantic distance.

affinity_smoothness — Does the system’s modeled affinity curve follow a realistic trajectory? Computed as the inverse of the second derivative — sharp jumps are penalized.

cross_model_consistency — Run the same scenario twice. Does the system produce similar outputs? Measures reproducibility.

episodic_segmentation_recall — Conversations have episodes: an opening, an escalation, a pivot, a resolution attempt. Does the system correctly identify these boundaries?


Why this matters for training data

If you’re building a dialogue model and you evaluate your training data with an LLM, you’re optimizing for “sounds good to a language model.” That’s not nothing — but it’s not what you actually want.

What you want is: does this data represent the full distribution of human conversational behavior, including its inconsistencies, its emotional dynamics, its moments of deliberate deception?

Deterministic metrics answer questions that LLMs can’t answer reliably:

  • Does behavioral entropy fall in a realistic range? (An LLM will call monotone dialogue “consistent” rather than “boring.”)
  • Are belief states causally driving behavior? (An LLM will find post-hoc rationalizations for any behavior.)
  • Does the persona drift? (An LLM may not notice drift if individual turns are individually plausible.)

These aren’t hard questions to answer with simple statistics. They’re just rarely asked.


The tradeoff

Deterministic metrics can’t catch everything. They don’t measure whether dialogue is engaging, whether the vocabulary feels natural for the archetype, or whether the emotional texture is right. There are real limitations.

But for the specific properties we care about — causal coherence between cognition and behavior, persona consistency, belief realism — they’re more reliable than asking a language model, and they’re reproducible. The same dataset always produces the same scores. That makes them useful as baselines.

If you’re generating synthetic dialogue data and evaluating it with the same class of model that generated it, you’re probably producing data that’s subtly broken in ways you can’t see. We find this problem worth solving.


From static evaluation to real-time behavior

These metrics aren’t just for validating datasets. They come from a system that simulates how beliefs and relationships evolve turn by turn — deterministically, before language is generated.

Generate a synthetic human and start a conversation at stratasynth.com/demo/try. Instead of asking an LLM “was this coherent?”, you can see in real time how trust changes after each message, how defensiveness emerges, when conflict escalates or de-escalates.

LLMs judge what sounds right. This system shows what behaves consistently.


StrataSynth generates psychologically grounded synthetic dialogue datasets. The four HuggingFace datasets were each evaluated against these 12 metrics before publication.

stratasynth.com · HuggingFace datasets · pip install stratasynth-client