USING ADAPTIVE EXPERIMENTATION TO DESIGN REAL-WORLD AI SYSTEMS FOR EDUCATION & HEALTH
AI systems can fail to deliver societal benefit when they lack domain-specific knowledge about how humans behave. AI algorithms have the potential to help all students learn more or help anyone be less anxious. But too often they help some while harming others. Helping all users requires incorporating knowledge from users and practitioners about which actions to experiment with – like which explanations make sense to each student, and which coaching messages help each individual who is stressed.
I develop methods to better generate these actions by combining knowledge from LLMs, users, practitioners, and scientists [AC.50, AC.7]. Helping everyone requires optimizing the correct metrics and having appropriate data to build models of diverse users, to give different people the actions that result in equitable outcomes [AC.48]. I have shown how Reinforcement Learning algorithms can better achieve this goal by using Human-Computer Interaction and Social-Behavioural Science to design better metrics for behaviour.
My research program develops novel tools and methods for Adaptive (A/B/N) Experimentation. This enables comparison of different actions from AI systems, evaluating which are more effective, and deploying these. These cycles of experimentation are perpetually repeated as an engine for AI systems that never stop learning. I use Adaptive Experiments to embed LLMs and RL into the interfaces used every day by millions of people.
For example, we created an AI system that used text messages for digital health coaching – psychoeducation and prompts for managing stress, reducing depression, and encouraging exercise. It supported university students, the general population, and Spanish speaking Latinos/Latinas in California who had depression and diabetes. Multi-armed bandit algorithms (RL) automatically experimented with diverse message types, modeled subgroup differences, and continually increased the probability of sending the most effective messages for different people. This helped detect when messages that benefited most users inadvertently harmed a statistical minority. This contextualized and personalized AI system helps everyone get the support they need to get better outcomes. These and other interventions we've developed have impacted over 500,000 people in digital health and education to learn more, reduce anxiety, and increase exercise.
I instantiated the tools and methods I developed into the AdaptEx framework. It won first in the prestigious $1M Xprize for the Future of AI & Experimentation, by going beyond widely used approaches to A/B testing and randomized controlled trials. A $3M NSF Cyberinfrastructure grant was awarded to expand access to AdaptEx – as a testbed for AI researchers to work with domain experts and scientists, to design and do online evaluation of different algorithms' benefits for people's behaviour. My research program unifies many areas, through publications in Applied Generative AI, Applied Reinforcement learning, Statistics, Social Behavioural Sciences (e.g. Psychology, Education, Health), and Human-Computer Interaction. I now highlight three future research trajectories that my past work has provided a foundation for:
(1) Tools for Adaptive Field Experimentation & Collective Intelligence, as applied to Education
(2) Adaptive Experimentation Methods for Personalization, as applied to Mental and Physical Health
(3) Accelerating Scientific Research With Statistically Reliable Algorithms for Adaptive Experimentation
ADAPTIVE FIELD EXPERIMENTATION TOOLS & COLLECTIVE INTELLIGENCE: APPLICATION TO EDUCATION
I aim to use Adaptive Experimentation tools and methods to transform ubiquitous yet static educational resources – so they become intelligent interfaces that constantly improve. Different versions of resources are generated through co-design between LLMs and humans – scientists, teachers, and students [AC.49]. These incorporate contextual knowledge from different stakeholders into alternative methods for teaching. To evaluate these, we design experiments that adapt AI to meet practical goals of teachers, are user-centered in design for students, and use decades of research by social-behavioural scientists.
For example, suppose a student in Canvas struggles with a programming problem and finds an instructor’s digital explanation unhelpful. The AdaptEx framework turns this explanation into an intelligent reinforcement learning agent by embedding an Adaptive Experiment that can:
Generate diverse alternative explanations using input from students, instructors, researchers, and LLMs.
Evaluate them through randomized A/B experiments that measure engagement and learning.
Adapt in real time using reinforcement learning to give better explanations sooner, and continuously add new ideas from LLMs and people.
We used HCI methods to embed a system into Canvas that used Thompson Sampling to estimate the probability that explanation A works better than B, based on rigorous behavioural science metrics. The system assigned students to explanations in proportion to that probability. An adaptive experiment can start at 50/50 and gradually shift to 60/40, 80/20, converging toward 100/0 as evidence accumulates. Pre- to post-tests showed a 10% learning gain. The system was readily adopted by Harvard instructors because of its interpretability – we showed how each student’s assignment probability depended on the data, and how the algorithm balanced exploration with helping students immediately. Scientists also valued how removing weaker explanations early enabled them to add new versions to rapidly test hypotheses about learning. Using AdaptEx to help students reframe pre-exam stress, we identified explanations that raised average performance from a B to an A- after only ~4 minutes of reading [AC.35]. We ran 5 experiments with 1,200 students in two months – work that took past researchers two years.
Ongoing and Future work: The Xprize recognized AdaptEx as a foundation for the future of AI & experimentation in education, and a $3M NSF “software for scientists” grant now supports scaling it as a testbed for AI researchers to evaluate algorithms in education. This provides the foundation for extensive research: using adaptive experiments to automatically improve homework systems, peer learning, GenAI tutors and other resources. There are many open questions on how to support diversity, such as first-generation students, English-language learners, and different cultural backgrounds.
A major direction is End-User prompt engineering. Every message a teacher or student sends to ChatGPT effectively functions as Prompt A. Adaptive experimentation allows us to test alternative prompts (B or C) that incorporate user- or population-specific knowledge. This identifies which prompts help LLMs access the most relevant context and reveals how those insights can be embedded directly into future models. Our work shows how teachers and scientists can use this approach to co-design more effective prompts and tailor ChatGPT’s behaviour to students with different levels of prior knowledge and verbal fluency.
We also build interfaces that guide learners in prompting LLMs. For example, SPARK helps students generate effective self-talk when procrastinating by exposing “knobs’’–sliders for tone, complexity, and scientific rationale – and showing how experts would revise the prompt. This enables students to obtain outputs that are more effective than default ChatGPT. Winning the DARPA Learning Tools Competition recognized how this work is using experimentation to shape the future of LLM-based learning tools for researchers and practitioners.
ADAPTIVE EXPERIMENTATION FOR PERSONALIZATION: APPLICATION TO MENTAL & PHYSICAL HEALTH BEHAVIOUR
I aim to generalize adaptive experimentation to personalization and contextualization across many domains, to ensure we build AI systems that can serve many user subgroups. I present applications to behaviour change in mental and physical health, which provides the foundation for decades of work until we can help everyone. We use Adaptive Experimentation to turn SMS programs into reinforcement-learning agents that test dozens of coaching messages and learn which are most effective for different people. Instead of relying on biased or outdated off-policy data, we design prospective experiments that collect data aligned to contextual bandit algorithms, in populations experiencing stress, anxiety, and depression [J.17, AC.30].
Extending this to physical health, I collaborated with Adrian Aguilera (Berkeley Social Work) to design Spanish-language messages for Latino/Latina Californians with depression and diabetes, grounded in psychology and HCI principles, and personalized to individuals’ psychological states and situational challenges [J.31]. This reduces inequity by collecting high-quality contextual data from an underserved population and tailoring interventions to subgroups and individuals within that population. Our integration of scientists' and social workers' knowledge into contextual Thompson Sampling resulted in students and older adults with diabetes and depression substantially increasing their step count and engaging with clinical psychology content, improving their physical and mental health [J.31, J.20, J.18].
A core contribution of this work is identifying which algorithms and parameter settings are most reliable for personalization. We demonstrate that many intuitive personalization techniques exploit low-signal data and fail in practice; our studies characterize the empirical conditions under which different algorithms are more or less likely to successfully personalize [AC.18, AC.22]. We used Adaptive Experiments to identify when a status quo message that benefited most users inadvertently harmed a statistical minority. We increased outcomes for underrepresented subgroups by 25%, while reducing the sample size needed to detect these effects and effectively personalize.
We are also providing new techniques to personalize LLM content and coaching to a user’s specific circumstances. We collected stories about the mental health challenges faced by first-generation students, single parents, and others underrepresented in LLM training data, and used them to design a prompt-engineering framework that adapts clinician-written narratives to the exact situation a user reports. These personalized LLM messages produced greater reductions in negative thoughts, and better reflection on how to apply principles to their context.
This work opens a broad set of questions about developing better algorithms and models for contextualization and personalization across diverse contexts, populations, and individual needs. My ongoing and future work goes deeper into mental and physical health. It also generalizes these methods to belief and behaviour change in additional high-impact domains, including mitigating political polarization, supporting sustainable behaviours to reduce climate change, and addressing misinformation on social media.
ACCELERATING SCIENTIFIC RESEARCH WITH STATISTICALLY RELIABLE ALGORITHMS FOR ADAPTIVE EXPERIMENTATION
My real-world deployments have uncovered severe limitations in using Reinforcement Learning for scientific experiments – limitations so fundamental that they arise in the most complex reward maximizing algorithms and the simplest bandit algorithms. I have shown that reward-maximizing algorithms can inflate false positives from ~5% to ~13%, and false negatives from ~20% to ~34%. These issues are not resolved by the algorithms commonly assumed to be statistically sound in the machine learning literature. “Proven” guarantees too often rely on formulations that became widespread by historical accident or convenience in mathematical proofs. But they often rely on assumptions that break down in societally important domains – where human data are sparse, noisy, and context-dependent.
To address this gap, I have initiated a research program on Statistically Reliable Algorithms for Adaptive Experimentation. These provide the data quality that scientists need to draw rigorous and trustworthy conclusions from domains like education, health, and other societally relevant problems that concern human behaviour. We have developed algorithms that are both more interpretable and more interactive, because experimenters can encode their contextual knowledge directly into relevant parameters.
One underexplored class of these methods extends Thompson Sampling using Adaptive Epsilon-Thompson Sampling [WP.9], where epsilon is the probability of running a traditional uniform random experiment and (1 – epsilon) is the probability of using Thompson Sampling. Our novel approach, TS-Posterior Difference, lets experimenters specify what they consider a “small difference” between arms. TS-Posterior Difference then sets epsilon to the posterior probability that the difference between arms falls below that threshold. This increases traditional experimentation precisely when intervention differences are small, reducing false positives and false negatives, while still maximizing reward when differences are larger. TS-Posterior Difference outperforms the state of the art on the full tradeoff between false positives, false negatives, and reward.
To move beyond regret as the primary evaluation metric for bandit algorithms, we have developed an objective function that captures these tradeoffs in a way that aligns directly with the needs of practitioners and scientists in education and health [WP.10]. Our algorithm increased participant outcomes by 17%, and reduced the false negative rate by 4%. This is a substantial improvement for AI systems designed to help scientists test more ideas, more quickly, in high stakes real-world contexts.
These point to many open questions in developing Statistically Reliable Algorithms for accelerating experimentation. This is a core challenge in the behavioural and social sciences, where reliable adaptive experimentation is needed for AI systems that aim to benefit society by supporting behaviour change across domains such as education, health, and others. It is also a challenge in using AI in the natural sciences. We received a grant from the $200 M Acceleration Consortium to evaluate and extend our algorithms in chemistry and biology.
CONCLUSION
My research investigates how adaptive experimentation can transform everyday interfaces – learning platforms, health apps, and messaging systems – into AI systems that constantly improve. By integrating Reinforcement Learning, LLMs, and insights from social-behavioural science and HCI, I design AI systems that learn to impact human beliefs and behaviour in real-world domains: education, health, misinformation, polarization, and climate awareness. These systems discover how to personalize to help everyone act in ways that benefit them and society.
We gratefully acknowledge support from our collaborators: