Your AI Evaluations Are Broken. Here's Why You Need Fusion Sentinel. Stop Missing What Actually Matters.
RAND found 5 systematic failures in AI evaluation. Model drift, invalid baselines, contamination. Fusion Sentinel provides continuous monitoring that point-in-time audits miss.
RAND interviewed 16 experts who run human uplift studies measuring AI's impact on human performance. These are the evaluations companies use to decide whether to deploy AI systems. The evaluations regulators rely on to assess risk. The studies that determine whether your organization adopts tools that could save or cost millions.
Every single expert reported the same problem: the AI they're evaluating changes mid-study.
Not hypothetically.
Not occasionally.
Systematically.
One expert started a three-month cybersecurity study testing whether AI helps attackers breach systems.
Month one: the model could execute code and install Python tools in its environment.
Month three: same model, constant refusals.
The model had updated. Nobody flagged it. The study compared two different systems while reporting results for one.
That's not an evaluation. That's a guess wrapped in statistics.
Here's why this matters. ChatGPT Health launched January 7, 2026 with 40 million daily users. Mount Sinai published results February 23, 2026 showing 52% emergency miss rate. That's 45 days of deployment, 1.8 billion health queries, 900 million potential failures before anyone knew the system was broken.
The evaluation that could have caught this? It would have been invalid before publication. Because the model you test today isn't the model users get tomorrow.
RAND's research maps five systematic failures in how organizations evaluate AI. Every pattern creates liability. Every gap costs money. Every assumption that breaks down puts your deployment at risk.
Here's what you need to know.
Pattern 1: Intervention Fidelity Collapses Under Rapid Model Evolution
Expert N described the cybersecurity study: "When we started designing this experiment, the publicly available models were capable of running code and installing different Python tools. We are starting now to see a lot more refusals from the same exact model. The model has undergone an update, and we no longer have access to that snapshot of the previous instance. If you run a study over three months in which that model is being updated and you're unaware, you're comparing apples and oranges."
This isn't edge case failure. This is systematic breakdown of the core assumption underlying randomized controlled trials: that the intervention remains constant.
In drug trials, the chemical composition of the pill doesn't change between Week 1 and Week 12 but in AI evaluations, the model, system prompts, safety filters, plugins, and auxiliary tools can all shift. Often without notice. Sometimes multiple times during a single study.
When participants are exposed to different model versions at the same time, you've introduced unbalanced heterogeneity in the intervention. Your internal validity is gone. You're no longer measuring the causal effect of AI access, now you're measuring the combined effect of Model Version A for half your sample and Model Version B for the other half.
When the intervention changes uniformly over time due to a globally deployed update, internal validity may survive but you're not estimating the effect of a model anymore. You're estimating the effect of exposure to a changing system over a specified time horizon.
The legal exposure: you deploy based on evaluation results that measured a system that no longer exists. When harm occurs, your evaluation doesn't defend you. It exposes you.
Pattern 2: Control Conditions Fail in AI-Integrated Environments
Expert P noted: "In any control setting, there's going to be some technology available, whether it be AI, or it used to be called AI but now it's no longer AI, or maybe it's some other AI tool. It's always relative to something. Think about the coding papers that study how Copilot impacts programmers. Well, before Copilot, there was Autocomplete and TabComplete and IDEs."
Unlike traditional interventions, LLMs don't replace a single prior tool. They layer onto an ecosystem of existing technologies and this creates two problems.
First, there's no obvious counterfactual. What's the right comparison? Internet search? Older AI tools? Human experts? The choice determines what uplift means. A study comparing Claude Opus 4 to internet search measures something fundamentally different than a study comparing Claude Opus 4 to GPT-4.
Second, control definitions vary wildly across studies. Some restrict participants to basic internet search. Others provide access to human experts or alternative software. There's no consensus. No standard. Every study defines its own baseline.
The result? You can't compare uplift estimates across studies longitudinally. You can't track whether AI is getting more or less helpful over time and you can't benchmark your deployment against industry standards because there aren't any.
The business impact: Say you commission an evaluation showing 40% productivity uplift. Then your competitor commissions one showing 60%. There’s looks better but it turns out they used a weaker baseline. Now your board is looking at you asking how you got beat and why you're behind. But in reality, are you really behind? More than likely, you're probably not, but you can't prove it.
Pattern 3: Contamination Trivializes Experimental Control
Expert A observed: "I expect cheating to be much more salient in LLM uplift studies, especially if there is an internet-only control group. Contrast this with a clinical drug trial, where, if you're not giving the control group the drugs, they're probably not going to be able to acquire it."
In drug trials, control groups can't easily access the experimental treatment but in AI evaluations, control group participants can just open ChatGPT in another browser tab. Contamination occurs when the control group participants directly access restricted AI tools. No more control group. Not to mention spillovers. Spillovers occur when exposure diffuses indirectly through social interaction, shared strategies, or collaboration. Either way, both violate the non-interference assumptions required for causal inference.
Risks escalate in settings with close cohort structures. Classrooms, labs, workplaces, training programs. Participants naturally exchange information. One person in the treatment group shows a control group colleague how Claude solves a problem. Your experimental design is now compromised.
The trade-off: studies designed to improve external validity by approximating real-world, longer-term use face heightened contamination risks. Short, tightly controlled studies preserve causal identification but fail to reflect how AI actually gets used.
The operational impact: you run a six-month productivity study. Control group performance improves 30%. Not because they suddenly got better at their jobs. No, because they were using ChatGPT the whole time and you didn't catch it. Now, your uplift estimate is meaningless.
Pattern 4: AI Literacy Varies and Evolves Faster Than Your Study
Expert L explained: "If you're selecting someone who is a complete novice, has never used GenAI, I think you will not see much uplift just because they don't know how to use the tool. Give the same person six months to learn about GenAI and ask them the same question six months down the road, they might be successful."
AI literacy determines performance. How YOU prompt matters. How YOU interpret outputs matters. How YOU integrate the tool into your workflow matters. Variation in these skills creates two distinct validity threats.
The first is external validity. If participant skill doesn't reflect the population of interest, uplift estimates won't generalize. You recruit students who've never used ChatGPT. Your results don't predict what happens when you give Claude to lawyers who've been prompt engineering for two years.
The second is Internal validity. If AI literacy varies within your study and isn't controlled for, it acts as a confounder. High-skill participants in treatment show uplift. Low-skill participants show none. You report average uplift of 20%. Neither number is technically wrong, just both are misleading.
Worse case, AI literacy evolves during your study. Participants learn. They share strategies. They discover better prompts. By Week 12, they're using the tool differently than Week 1. Your intervention isn't just the AI. It's the AI plus accumulated learning.
Expert D characterized this as interpreting results over time: "If you run another study with the same group of people six months later, the world has probably changed in pretty meaningful ways. People are more familiar with using LLMs. That's going to change the way that they perform."
The strategic risk: your evaluation shows modest uplift. You decide not to deploy. Your competitor's evaluation shows strong uplift. They deploy. Turns out the difference was participant training, not model capability. You just lost market position based on a confounded study.
Pattern 5: Results Become Obsolete Before Publication
Expert O explained why cross-model comparisons are infeasible: "If you ran a study with this model and then reviewers are like, well, why don't you try model X or Y? It's not a static benchmark that you can just rerun. You'd have to recruit a new group of participants, which is not necessarily realistic."
Unlike computational benchmarks, human uplift studies can't be rerun on updated models without substantial recruitment and execution costs. Direct comparison across model versions is often infeasible in practice.
Meanwhile, both AI systems and patterns of human use evolve rapidly. Model capabilities change. Deployment contexts shift. Participant AI literacy increases. What results mean over time becomes unclear.
Interpretive difficulties compound when baselines shift. As open-source models improve and AI integrates into everyday tools, detecting incremental gains becomes harder. Relying on outdated baselines risks comparisons that no longer reflect realistic conditions.
Expert D called this the "boiling frog" problem. Gradual changes in reference points obscure substantial shifts in absolute capability. You run a study in January showing 30% uplift over GPT-3.5. By June, GPT-3.5 is now irrelevant. Your baseline doesn't represent what users actually have access to anymore.
Worse, many interpretive challenges originate in earlier design choices. How constructs were operationalized. How controls were defined. How populations were recruited. Even rigorously executed studies become difficult to interpret when the context they measured no longer exists.
The governance problem: uplift results are often interpreted as speaking to future, scaled, or system-level impacts. Post-deployment performance. Long-term risk. Even when studies were designed to estimate effects at a single point in time under specific conditions. Without careful qualification, extrapolations overstate what the evidence supports.
What Regulations Actually Require
- The EU AI Act mandates continuous monitoring for high-risk systems. Not one-time evaluations. Continuous monitoring. If your evaluation measured a model three months ago and the model has updated since, you're not in compliance.
- ISO 42001 requires ongoing validation of AI management systems. That includes monitoring for behavioral drift. When your model's outputs start differing from testing, you need to know before users are affected.
- The UK AI Safety Institute approach to evaluations explicitly acknowledges these challenges. Their February 2024 framework emphasizes that evaluations must account for rapid model evolution, shifting baselines, and real-world deployment contexts.
- NIST AI 800-1 on Managing Misuse Risk for Dual-Use Foundation Models notes that single-point evaluations cannot capture how capabilities emerge or how users adapt over time. The guidance explicitly calls for monitoring approaches that track system behavior across deployment.
The legal standard is forming while you're making deployment decisions. If your system fails after an evaluation that measured a different version, your evaluation doesn't protect you. It proves you deployed without current knowledge of system behavior.
What Continuous Monitoring Catches
Single-point evaluations measure AI at a moment in time while continuous monitoring tracks how systems behave across deployment.
There’s something called Intervention fidelity. This is when models update mid-deployment but with continuous monitoring it flags the change and you know immediately whether the system users are interacting with matches the system you evaluated.
We talk about behavioral drift a lot. This of drift as any output that starts diverging from expected patterns – that’s drift. You need to catch these in real time, not six months later when evaluation results publish. Not after users report problems but while you can still intervene.
User adaptation is when AI literacy increases and usage patterns change, however, with continuous monitoring it shows you how performance evolves. You can see whether initial uplift estimates still hold as users get better at prompting.
Then there are the baseline shifts. Think of it this way. Everyone starts out with apples and then apples are no longer available. Now, applied to AI tooling, when reference points change because competing tools improve or are no longer available, monitoring keeps comparisons relevant. You're not anchored to an outdated baseline that no longer represents realistic conditions.
We built Fusion Sentinel because single-point evaluations can't keep pace with how fast AI systems change. By the time traditional evaluation results publish, they're measuring a system that doesn't exist anymore. Sentinel provides real-time observability across deployment. Not just at launch. Not just during periodic reviews. Continuously.
What You Do This Week
- Stop relying on point-in-time evaluations to make deployment decisions about systems that update weekly. Start monitoring continuously.
- Demand intervention fidelity guarantees from providers. If you're evaluating Model Version X, you need assurance that Version X is what users will get. Not Version X.1 that deployed during your study.
- Standardize your baseline definitions. Document what control condition you're comparing against. Make it explicit. Make it defendable. Make it consistent across evaluations so you can track performance over time.
- Account for AI literacy in your analysis. Measure participant skill. Control for it. Report results by skill level. Don't let variation in user proficiency confound your causal estimates.
- Treat evaluation results as time-bound estimates, not eternal truths. The uplift you measured three months ago doesn't tell you what's happening today unless you're monitoring continuously.
RAND's research makes it clear: methodological challenges in AI evaluation aren't edge cases. They're systematic, they threaten every form of validity making interpretation unreliable, and they're only getting worse as model updates accelerate, AI literacy increases, and deployment contexts evolve.
The Offer
We’re offering 30-minute consultations for organizations serious about valid AI evaluation. We'll review your evaluation approach. Identify where single-point studies leave you exposed. Show you exactly what continuous monitoring catches that periodic evaluations miss.
No sales pitch. No jargon. No BS. Just a direct conversation about where your evaluations break down and what you can do about it.
Email: info@fusioncollective.net
Subject line: "Evaluation Infrastructure"
We read every one.
The Choice
RAND documented every failure mode in current AI evaluation practices. Intervention fidelity collapses. Control conditions fail. Contamination trivializes experimental design. AI literacy confounds results. Findings become obsolete before publication.
Every pattern is documented. Every threat to validity is known. Every solution exists.
The question is whether you implement continuous monitoring before your deployment decisions are based on evaluations that measured systems that no longer exist or after.
Why do AI evaluations become invalid before they are published?
AI evaluations become invalid before publication because the models they measure update continuously during and after the study period, meaning the system documented in the results no longer reflects the system users are actually interacting with. RAND research involving 16 AI evaluation experts confirmed this as a systematic failure, not an edge case. One cybersecurity study found that a model capable of executing code and installing Python tools in month one was issuing constant refusals by month three, despite being marketed as the same model. The study had unknowingly compared two different systems while reporting results for one. The structural problem is that AI evaluations borrow methodology from clinical drug trials, where the pill does not change composition between Week 1 and Week 12. In AI deployment, the model, system prompts, safety filters, and auxiliary tools can all shift, often multiple times, without notice to the evaluating organization. This is why continuous monitoring exists as a category. Fusion Sentinel tracks model behavior in real time across the full deployment lifecycle, flagging the moment a system drifts from the version that was evaluated and approved.
What is AI model drift and why does it matter for compliance?
AI model drift is the gradual or sudden divergence of an AI system's outputs and behaviors from the baseline that was originally tested, approved, and documented for deployment. Drift matters for compliance because regulatory frameworks including the EU AI Act, ISO 42001, and NIST AI 800-1 do not recognize point-in-time evaluations as sufficient validation for high-risk AI systems. They require continuous monitoring. When a model drifts post-deployment and the organization has no detection mechanism, any harm that results is traced back to a compliance gap, not a technical accident. Drift takes several forms in practice. Behavioral drift occurs when response patterns shift without a formal model update. Intervention fidelity collapse occurs when the model version users interact with no longer matches the version evaluated for deployment. Baseline drift occurs when the external environment changes so significantly that historical comparisons lose meaning. Organizations running periodic evaluations only discover drift after users have already been exposed to degraded or altered system behavior. Fusion Sentinel provides real-time AI observability that detects drift as it emerges, before it becomes a regulatory or legal exposure.
What does the EU AI Act require for continuous AI monitoring?
The EU AI Act requires that high-risk AI systems undergo continuous post-market monitoring throughout their operational lifetime, not a single pre-deployment evaluation followed by passive deployment. Under the Act, providers and deployers of high-risk AI systems must establish and document monitoring systems capable of detecting performance changes, behavioral drift, and deviations from approved system behavior. This obligation extends beyond initial conformity assessments. It applies to the live system as users interact with it, which means any model update that alters behavior triggers a re-evaluation obligation. The practical implication is significant. If your AI vendor updates the underlying model and your organization has no mechanism to detect that the system has changed, you are out of compliance from the moment that update deploys. The evaluation you commissioned before launch does not satisfy the Act's ongoing monitoring requirement. ISO 42001, the international standard for AI management systems, reinforces this. It requires documented procedures for ongoing validation, including explicit monitoring for behavioral drift. Organizations that treat evaluation as a one-time event rather than a continuous function face both regulatory exposure and the legal liability that follows when a drifted system causes harm. Fusion Sentinel is designed to fulfill this continuous monitoring obligation directly, providing the documented behavioral tracking that regulators are increasingly requiring as a baseline for compliant AI deployment.
What is the difference between AI evaluation and continuous AI monitoring?
AI evaluation is a point-in-time measurement of how a specific AI system performs under specific conditions on a specific date. Continuous AI monitoring is the ongoing, real-time tracking of how that same system behaves across its entire deployment lifecycle as conditions, users, and the model itself evolve. The distinction is not semantic. It is the difference between a snapshot and a surveillance system. An evaluation tells you what was true when you looked. Monitoring tells you what is true right now. RAND research documented five systematic failure patterns in current AI evaluation practice, all rooted in the same structural flaw: evaluations assume the intervention stays constant. AI systems do not. Models update. User behavior adapts. Baselines shift. Contamination occurs. By the time evaluation results publish, the window they measured has already closed. Continuous monitoring addresses each failure pattern directly. It tracks intervention fidelity so you know the moment a model update changes what users are experiencing. It catches behavioral drift before it compounds into documented harm. It tracks user adaptation patterns so uplift estimates remain current. And it keeps baseline comparisons anchored to live conditions, not obsolete reference points. For organizations subject to EU AI Act, ISO 42001, or NIST AI RMF requirements, continuous monitoring is not optional infrastructure. It is the compliance mechanism regulators are designing their frameworks around.
How do silent AI model updates create legal liability for organizations?
Silent AI model updates create legal liability because they produce a gap between the system an organization evaluated and approved for deployment and the system users are actually interacting with, often without any documentation, disclosure, or internal awareness that the gap exists. When harm occurs after a silent update, the organization's evaluation record does not protect it. It proves the organization deployed without current knowledge of system behavior. That is the opposite of a legal defense. The ChatGPT Health case demonstrates the scale of this risk. The system launched January 7, 2026. Mount Sinai published findings on February 23, 2026 showing a 52% emergency symptom miss rate. That is 45 days of deployment, an estimated 1.8 billion health queries, and 900 million potential failures before any external monitoring caught what internal evaluation had missed. The legal standard forming around AI deployment does not require proof of malicious intent. It requires proof of due diligence. Due diligence means knowing the current behavior of the system you are operating, not the behavior of the version you tested three months ago. Organizations that cannot demonstrate continuous behavioral monitoring are increasingly exposed to negligence claims when their deployed system causes harm, regardless of how rigorous their initial evaluation was. Continuous monitoring is the documented due diligence record that separates organizations that can defend their deployment decisions from those that cannot.
Share this article
Related Articles
Europe Built Guardrails. America Published a Study Guide. OpenAI Proved Who Was Right.
Feb 18, 2026