Research Statement & Lab Vision

Let us know if you see opportunities for collaboration, and visiting your university or company during my sabbatical (starting September 2025). Feedback as comments/suggested edits are welcome at tiny.cc/adintresearch.

Using Adaptive Experimentation to Design Intelligent Adaptive Interventions For Belief & Behaviour Change

Behaviour change is a problem 8 billion people encounter every day: How to stop doing X, or start doing Y? My ten year goal is to give every individual and organization access to intelligent coaches that help in changing any kind of human behaviour to achieve their goals. The last ten years built the foundation for this work: I've succeeded in turning ubiquitous technology touchpoints–components of apps, websites, emails, texts–into micro-laboratories that continually test the effectiveness of Interventions that quantifiably impact people's behaviour. Examples of Interventions I've worked on span: Education: Which explanations enhance learning [AC.7]? Health: Which text messages promote mental & physical health [AC.31]? Domain-General: What prompts guide decision-making, like whether to buy a product or donate to charity [WP.6]? To take one example, by investigating how to craft messages that help students reframe stress as helping vs hurting performance, we discovered how a 3-minute coaching message could boost student grades by 4% [AC.34, all papers at tiny.cc/jjwcv].

I also develop tools and methods anyone can use to make such Interventions more Intelligent. E.g. How can we make digital course components more similar to our best teachers? [bit.ly/tedxwilliams]. How can we make a text messaging chatbot closer to an excellent therapist? I use machine learning for adaptively randomised A/B experimentation, to automatically discover which of several alternative interventions impacts the target behaviour. This identifies what is best: (1) On Average; (2) For who (Personalization); and (3) For a particular context (Contextualization). Because what is best can change over time, we need to perpetually generate and test new actions. We generate actions through: (1) LLMs; (2) Human-LLM co-design; and (3) Human Computation, such as competitions that crowdsource ideas for messages from users, designers, and social-behavioural scientists.

We invented the AdaptEx (Adaptive Experiment) framework [AC.29] to transform technology touchpoints used by billions of people (e.g. text messages, emails, websites, apps), into agents that perpetually experiment, to discover and deploy whichever interventions best help a person in a particular context. AdaptExp was used to create interventions delivered to over 500 000 people. AdaptEx won a $1M Xprize for transforming education through AI-driven experimentation, and a $3M NSF grant to enable others access to this "Cyberinfrastructure".

To bridge scientific experimentation with real-world impact, my group publishes across human-computer interaction, experimental psychology, applied AI (LLMs), applied statistics and machine learning, digital education, mental health, and behaviour change. This all leads to four themes in our work: (1) How we've pioneered the use of Adaptive Experiments in Education; (2) How we integrate LLMs with experimentation, as thought partners; (3) Personalization & Contextualization in Mental & Physical Health; (4) Developing algorithms and analyses for accelerating experimentation and practical impact across multiple scientific fields.

1. Adaptive Field Experimentation Tools & Collective Intelligence: Application To Learning

Eight billion people can be understood as acting as learners and teachers every day, when they engage in any act of communication that can influence a behaviour or decision. Whenever someone reads an email, thinks about a text, skims a website, reflects on a LinkedIn post, they are potentially acquiring knowledge and skills they can take action on. I investigate the design of domain-general systems and methods for testing ways to enhance and understand learning.

Consider an example from formal education, of how AdaptEx enables the paradigm I've pioneered for Intelligent Interventions through Adaptive Experimentation [AC.29, my TEDx talk]: A student in Canvas struggles to understand how to solve a programming problem, and the digital explanation from the instructor doesn't help, increasing discouragement. To solve this, the AdaptEx framework converts the interface component that presents the explanation into an AdapComp (Adaptive Component) 'agent'. The AdapComp explanation 'agent' engages in: (1) Easier Generation of different versions of an explanation, by crowdsourcing alternative explanations from users (students, AC.7), designers (instructors, AC.11), scientists (AC.20), and LLMs (WP.5). (2) Effective Evaluation using Randomized Experiments that give alternative explanations A vs B vs C to different students, and measures the impact on students' engagement and learning; (3) Expedited Data Use to help students, and accelerate scientific discovery, by adapting the experiment using reinforcement learning algorithms (AC.11).

We model the probability that explanation A is better than B, then adapt by assigning explanations proportional to this probability. For example, when an adaptive experiment that transitions from 50/50 assignment to 60/40, 80/20 & so on, it adaptively balances getting more data, with using the existing data to help students right away [AC.7]. This can lead to students performing 10% better [J.13]. Moreover, by adaptively removing less promising explanations, an adaptive experiment can accelerate testing of other ideas, increasing the probability of statistical analysis making a scientific discovery by 5% [shown in our Xprize report, & forthcoming paper].

Impact: We used AdaptEx for explanations of how students could reframe stress before a test – as a positive to help them pay attention, vs a negative that distracts them. By discovering just the right explanation, we boosted the average grade from a B to an A-, although only ~3 minutes were spent reading it [AC.35]. Moreover, we ran 5 experiments with 1200 people in 2 months, compared to the psychology research we generalized, which ran 5 experiments with >200 participants in 2 years. Practitioners are more willing to run and replicate experiments when ML algorithms are used to adapt them, because concerns about the ethics of giving worse options are mitigated by using data to rapidly help [AC.11, Xprize]. Different kinds of experiments were recognized by judges in the Xprize win for AdaptEx being the future of experimentation in education. Moreover, a $3M NSF "software for scientists" grant was awarded to broaden AdaptEx tools for use by a collaborative ecosystem of instructors, learning scientists, machine learning researchers, and statisticians. I am ensuring these tools can be used beyond education, by focusing on atomic components of communication like text, video, prompts.

Future Research Questions: 1. What user and scientific needs can be better addressed through tools and techniques for adaptive experimentation? 2. How can we help people make better decisions through experimentation?

2. Adaptive Expt. & LLMs: Human-AI Co-Design & Scientific Basis for Prompt Engineering

We have 4 published papers and 3 working papers on using LLMs in user-facing applications [more at tiny.cc/iaillm].

How can we use LLMs for easier experimentation? We invented the ABScribe tool [AC.50] that enables Super-Experiments (multiple concurrent experiments) on AdapComps in any written content: from emails to text messages and LinkedIn posts. To help people generate and organize alternative versions of content, we designed an interface that guided people in effectively collaborating with an LLM. People without scientific training or programming skills were able to design multiple experiments and organize their decisions about writing, using an interface as natural as existing workflows like Google Docs.

We also investigated: How can we enable anyone to make better decisions by designing experiments, even with a small sample size? We are pioneering QUALitative experiments: We guide people in author alternative versions of content (e.g. website, email), and then collecting rich qualitative feedback from experts and users. One user said: "I never considered writing out alternative versions [side by side] like we are doing. So even without [statistically] significant data, that will be a benefit".

LLMs as thought partners: We are investigating the design of tools to enable anyone to use LLMs as a 'thought partner', that helps them think through a problem they are trying to solve. For example: To minimize their procrastination, what messages can they send to themselves or what questions is it helpful to ask [AC. 49]? Which emails motivate people to take action [AC.33]? Which text messages to send a friend or colleague when they are stressed [J.24]? What are good self-coaching to ask oneself when thinking through a problem [J.30]?

Future Research Questions: How can we collaborate with practitioners and behavioural scientists to solve their problems using LLMs & Adaptive Experimentation? How can (adaptive) experimentation be used for more rigorous prompt engineering, such as in LLM-based chatbots?

We are investigating how to take prompt engineering from trial-and-error based on intuitions, to Super-Experiments that systematically vary multiple dimensions of a prompt, and get qualitative and quantitative data from hundreds of people [WP.4, AC.45]. The impact of using AdaptEx to enable scientific approaches to prompt engineering was recognized in our DARPA Learning Tools Competition win.

3. Adaptive Expt. for Personalization & Contextualization: Applications in Behaviour Change for Mental & Physical Health

To ensure our techniques are broadly applicable beyond education, we also explored applications to Behavior Change for Mental & Physical Health. Contexts using adaptive experimentation to help people included: Everyday mental wellbeing to clinical depression [J.17, AC.30], sleep hygiene through self-experiments [AC.21], diabetes prevention by coaching people to eat healthy & exercise [J.20, J.23]. This is represented in >15 papers and will remain an active area for as long people's core goals are being obstructed by behaviours and symptoms they want to change.

Computational Challenge: A key computational challenge is to use experiment-driven decision-making to generalize from: (1) Discovering better and better interventions for the average of a population; to (2) Discovering how to enable Personalization (whether A is better for Group X and B better for Group Y); and (3) Contextualization (whether A is better for someone in context F vs context G). We have found intuitively plausible personalization techniques often aren't validated in experiments, and the tremendous data required can make them less useful in practice [ONR Grants, $925K 2018-2024].

To better personalize and contextualize, we are exploring how to use particular algorithms, and identify promising empirical contexts (AC.18, AC.22). For example, randomized contextual mulit-armed bandit algorithms like TS (Thompson Sampling) have been effective in yielding good inferences [AC.38, AC.17], and being interpretable for adoption by scientists and practitioners. TS aims to use the data collected to assign actions in proportion to the probability that it is the best action, on average, as well as for a user's particular context. For example, after n data points, someone with High Energy state might have a 70% vs 30% chance of getting message A vs B (A favoured), while someone in a Low Energy state has a 20%/80% chance of getting message A vs B (B favoured). An empirical example was that in a High Energy state, it can be most helpful to reflect, by writing a message that will motivate them to be healthy. UIn a Low Energy state, it can be better to simply send them a motivating message [AC.35].

In addition, our Xprize work revealed when and how algorithms can be used to discover key insights: The action that seemed best for the statistical majority of users, can be *worse* for an underrepresented statistical minority. If we use the right methods for using algorithms to personalize and contextualize, we can therefore increase outcomes by 25%, and increase statistical power to detect these nuanced effects by 3%.

Future Research Questions: How can these techniques for Personalization and Contextualization be used in other contexts? How do we design workflows for scalable crowdsourced competitions, to obtain a greater diversity of ideas for personalizing and contextualizing?

One ongoing competition had 15 scientists (from 50 applications and a pool of 1000) collaboratively devise interventions to be deployed on Mental Health America, the largest nonprofit provider of free mental health screeners and DIY tools.

4. Future: How can we accelerate Scientific Research using Adaptive Experimentation? Integrating (A) Exemplars & Tools for Practitioners & Domain Scientists; (B) Statistically Sensitive Algorithms; (C) Algorithm Attuned Analyses

(A) Exemplars & Tools for Practitioners & Domain Scientists. Experimentation is key to many scientific disciplines. Adaptive experiments can change scientific fields, by using algorithms to analyze data in real-time, to change how data is being collected. My future research investigates how we can achieve two outcomes: (1) Using data from experiments to more quickly use science to have practical impact, on people in the real-world; (2) Accelerating the speed and reducing the costs of making scientific discoveries. Practitioners see value in making decisions based on data instead of intuitions, and having access to algorithms that automatically use that data to help users, in days vs months. This practical focus helps advance science in the real-world, as practitioners are more open to scientists conducting and replicating experiments in many different behavioural contexts [AC.11, Xprize]. The ambitious goal in my NSF grant is to get adaptive experimentation tools in use by >100 practitioners and >50 scientists, which we achieved 20% of through our Xprize Win.

Part (B) of this work investigates user-centered design approaches to creating algorithms that are Interpretable and Interactive: How do we design algorithms (and interfaces to algorithms), so their key parameters can be set using a practitioner or scientist's domain knowledge?

(B) Statistically Sensitive Algorithms. How can we develop algorithms that collect data to allow sound statistical inferences and scientific discoveries?

One barrier to adoption of adaptive experimentation algorithms is a lack of clarity and sensitivity around how and when it changes statistical conclusions, as it is gradually stopping some experiments early. For example, we [AC.13] & others have begun characterizing impact of specific algorithms on: (1) False Positives, which can waste practical resources and hold back science [e.g. falsely believing an expensive intervention A is better than B]; (2) False Negatives and lower Statistical Power – e.g. missing when a promising intervention B is better than A.

We are investigating what we term Statistically Sensitive Algorithms, which provide quantifiable estimates of these statistical quantities, that are reliable and interpretable to practitioners and domain scientists. Many ML algorithms could be used for adaptive experimentation, but too often rely on formulations that are convenient for mathematical proofs or historical reasons, but not sufficiently relevant to the specific application. We are investigating one class of Statistically Sensitive Algorithms that generalize the widely used Thompson Sampling Algorithm. Adaptive-Epsilon Thompson Sampling [WP.3] blends traditional and adaptive experiments, reducing false positive rate and false negative rate with only minor reductions in Reward (benefit to people). Adaptive-Epsilon-TS sets the amount of traditional experimentation to be proportional to the probability that the difference between two interventions is 'small' enough, that prioritizing Statistical Inference is more important than giving a better intervention, as the the benefit is 'small'. This allows a practitioner/scientist to define a threshold in a more interpretable way, than trying to specify a potentially arbitrary amount (10%? 20%?) of traditional experimentation.

Future Research Questions: (1) How do we include human-in-the-loop domain knowledge to trade off finding the best arm on average, with discovering how to personalize & contextualize? (Section 3); (2) How do we trade off optimization for competing short and long-term metrics, to avoid problems like misinformation & polarization on social media.
(3) When can algorithms increase both Practical Benefit *and* Statistical Inference? One example from Xprize work: When 3 or more interventions are being evaluated, the algorithm can more rapidly discriminate the best from the second best, by assigning fewer participants to obviously bad arms. The algorithm could increase outcomes for participants by 17%, *and* increase statistical power to detect an effect by 4%. This is just the first of many open questions in developing practically-relevant Statistically Sensitive Algorithms, to allow testing more ideas, more quickly.

(C) Algorithm Attuned Analyses: How can we use properties of the algorithms to modify or create techniques for analyzing data from adaptive experiments?

Statistically Sensitive Algorithms can be even more effective when paired with Algorithm Attuned Analyses, where we modify and create novel statistical analyses, that incorporate the properties of the particular algorithm used to collect the data. This is an exciting field as existing methods for traditional experiments and causal inference are just being applied to adaptive experiments, and there is uncertainty about how proposed methods match application contexts and domain scientists' needs. We incorporate both Bayesian & Frequentist analyses used by particular communities (e.g. testing of regression coefficients, Chi-squared, and t-tests). One thrust is Algorithm-Induced Test Statistics, which modify specific statistical tests used by a community. By simulating data collection under a particular algorithm, we explore ways of adjusting the critical values used to reject a null hypothesis of no difference [WP.1]. A second thrust develops novel test statistics and Bayesian analyses based on the algorithm itself, such as using the dynamics of the probability of assignment in Thompson Sampling, to create an Assignment Probability Test [WP. 2]. We have shown how these hypothesis tests can match false positive rate to traditional experimental analyses, and identified conditions under which they yield better Participant Benefit and Power than widely used methods.

Conclusion. I focus on developing Intelligent Adaptive Interventions, using technology for Behaviour Change. I design these through Adaptive Experimentation, by pioneering paradigms for people to easily do experiments to evaluate decisions about interventions, expediting discovery of how to enhance and contextualize interventions. My research program has already had practical and scientific impact in areas from education to mental and physical health behaviour change. My future research investigates how to scale and contextualize these techniques, to help the billions of people who are trying to change behaviours every day, to better achieve their goals.

I've developed the AdaptEx paradigm that won a $1M Xprize for AI-Driven Adaptive Experimentation, and published over 50 Conference Proceedings and 30 Journal papers. However, I'd like to further accelerate my research program's ambition, breadth, & impact, in the targets I want to hit in 3, 5, 10, and 20 years.

How Does Our Work Contribute to Real-World Applications of Artificial Intelligence & Machine Learning?

My research program aims to create IntInts (Intelligent Interventions) for Belief & Behaviour Change: IntInts generate alternative versions of interventions, and experimentally evaluate them to discover and deploy the actions that best impact users' real-world Beliefs & Behaviour. The AdapComp framework we invented (LAS 2021, Xprize, $3M NSF grant) turns components of everyday technology into these IntInts. IntInts have included components of text messages for health, explanations in courses, emails to motivate action, and chatbots.

We use Adaptive A/B Experiments to ease, expedite & enhance the design of IntInts. 3 components of the method for perpetual iterative design for IntInts are:

(1) Generation of alternative (Micro) Interventions. Generation uses both:

(a) Human Intelligence (e.g. human computation/crowdsourcing from scientists, practitioners & users) &

(b) Artificial Intelligence (e.g. LLMs).

Applications include versions of text messages for health, and explanations for education.

(2) Evaluation via Adaptive A/B Experiments that use both Qualitative and Quantitative data, to change the probability of assigning versions of interventions, to be proportional to the probability it is optimal for a user. E.g. Probability of Assignment (& Probability of Optimality) could change from 50/50 to 40/60 to 30/70 to 10/60/30, and so on.

The probabilities can be set via both:

(a) Human Intelligence (practitioners & scientists setting probabilities using their beliefs and interpretation of the Qual & Quant data) &

(b) Artificial Intelligence (reinforcement learning algorithms like our Statistically Sensitive Bandit Algorithm Thompson Sampling).

(3) Deployment of Intervention Versions that are increasingly likely to benefit users. Deployment is proportional to the probability an Intervention Version is best for a user: (a) On average, (b) Based on their subgroup, (c) Based on their Context.

Adaptive A/B Experiments combine (2) Evaluation and (3) Deployment to increase the speed which with these IntInts promote:

(a) Practical Impact to help users &

(b) Scientific Discovery to test more ideas more quickly.

Our conceptual and methodological instantiation of Adaptive Experimentation unifies (a) Practical Impact & (b) Scientific Discovery, powering this process through knowledge from the disciplines we've published in: HCI, Psychology, Statistics, AI/ML, Education, Health, & others.

This allows transforming even the simplest text message or explanation into an IntInt, which uses perpetual experimentation and enhancement, to produce Belief & Behaviour Change that helps individuals and society.

That's why I title my research program: Using Adaptive Experimentation to Design Intelligent Interventions for Belief & Behaviour Change

Societally Beneficial Adaptive Experimentation Statement: Emphasizes Nobel-Prizel Level Research

Transforming User Interfaces into Intelligent Agents: Perpetual Adaptive Experimentation for Human-AI Collaboration & Accelerating Scientific Discovery

Summary & Impact: I focus on Societally Beneficial Adaptive Experimentation, by pioneering paradigms for anyone who wants to help another person easily do experimentation to make more effective decisions, expediting discovery of how to enhance and contextualize interventions. I investigate how to design systems, tools and methods that transform real-world user interfaces into intelligent agents, which perpetually experiment to discover and deploy whichever interventions best help a person in a particular context. These Intelligently Adaptive Interventions are enabled by the AdaptEx (Adaptive Experiment) framework I invented [paper AC.29 on tiny.cc/jjwcv], to instantiate a paradigm for using randomized A/B experiments to bridge human intelligence (e.g. crowdsourced input from users, designers, & scientists) and artificial intelligence (e.g. reinforcement learning, LLMs). The AdaptEx framework won the $1M Xprize for transforming education through AI-driven experimentation. It has impacted over 500 000 users in education & mental health, and received a $3M NSF Cyberinfrastructure grant to scale AdaptEx for others to use and expand on the systems, tools & algorithms I've developed.

The Problem my research program tackles is: How can we allow anyone to use experimentation, to better design technology for social good?

Big 4 tech companies spent billions of dollars on tools and ML algorithms for randomized A/B experimentation, to identify how to get millions of customers, and make apps like instagram so addictive. I want to make these tools available to innovators developing technology for social good, but at a fraction of the cost spent by the Big 4. Consider two fundamental areas of society I've worked in: education and mental health. Too many students are two years behind after COVID – digital educational resources didn't work well enough. Too many people suffer poor mental health and conflict in relationships – apps and chatbots for mental wellbeing don't work well enough. To solve these problems, we need more intelligent systems for experimentation, that let anyone easily collect data to expedite test what works and doesn't work, to more effectively help a person in a specific context.

Approach: Section 1 explains the fundamental AdaptEx paradigm I pioneered for using Adaptive Experimentation to bridge human & artificial intelligence, for applications in education. Section 2 covers use of LLMs so anyone can more easily do human-AI co-design of adaptive experiments. This enables what we term Qualitative Experiments to better make everyday decisions, like what to say in an email to get someone to take action, or a text to help someone's mood. In addition, we cover using adaptive experimentation to provide people outside of openAI a better scientific basis for prompt engineering. Section 3 uses adaptive experimentation for contextualization and personalization of interventions, with applications to Behaviour Change for Mental & Physical Health. Section 4 explains our agenda to use computer science to accelerate multiple empirical sciences, by developing the algorithms and (statistical) analyses needed to shift domain scientists from traditional to adaptive experiments. Our greatest ambition adoption of adaptive experiments at a "Nobel Prize Level" scale, starting with a 10 year goal: 1000 researchers running an adaptive experiments across these and other disciplines: Education, Medicine (Physical/Mental Health), Chemistry, and Economics.

(1) Adaptive Field Experimentation Tools & Collective Intelligence: Application in Education

Consider an example of how AdaptEx enables the paradigm I've pioneered for Societally BeneficialBeneficital Adaptive Experimentation [AC.29, my TEDx talk]: A student in Canvas struggles to understand how to solve a digital programming problem, and the digital explanation from the instructor doesn't help, increasing discouragement. To solve this, the AdaptEx framework converts the interface component that presents the explanation into an AdapComp (Adaptive Component) 'agent'. The AdapComp explanation 'agent' engages in: (1) Easier Generation of different versions of an explanation, by crowdsourcing alternative explanations from users (students, AC.7), designers (instructors, AC.11), scientists (AC.20), and LLMs (WP.5). (2) Effective Evaluation using Randomized Experiments that give alternative explanations A vs B vs C to different students, and measures the impact on students' engagement and learning; (3) Expedited Data Use to help students, and accelerate scientific discovery, by adapting the experiment using reinforcement learning algorithms (AC.11). We modeled the probability that explanation A is better than B , then adapt by assigning explanations proportional to this probability. For example, when an adaptive experiment that transitions from 50/50 assignment to 60/40, 80/20 & so on, it adaptively balances getting more data with using the existing data to help students right away [AC.7]. This can lead to students performing 10% better [J.13]. Moreover, by adaptively removing less promising explanations, an adaptive experiment can accelerate testing of other ideas, increasing the probability of statistical analysis making a scientific discovery by 5% [Xprize].

Impact: We used AdaptEx for explanations of how students could reframe stress before a test – as a positive to help them pay attention, vs a negative that distracts them. By discovering just the right explanation, we boosted the average grade from a B to an A-, although only ~3 minutes were spent reading it [AC.35]. Moreover, we ran 5 experiments with 1200 people in 2 months, while the research we generalized from psychology research ran 5 experiments with >200 participants in 2 years. Practitioners are more willing to run experiments when ML algorithm are used to adapt them, because concerns about the ethics of giving worse options are mitigated by using data to rapidly help [AC.11]. Different kinds of experiments were recognized by judges in the Xprize win for AdaptEx being the future of experimentation in education. Moreover, a $3M NSF "software for scientists" was awarded to broaden AdaptEx tools for use by a collaborative ecosystem of instructors, learning scientists, machine learning researchers, and statisticians.

(2) Adaptive Expt. & LLMs: Human-AI Co-Design & Scientific Basis for Prompt Engineering

In addition to efficiency for impact and scientific discovery, we want access for anyone who wants to experiment to make better decisions about how to help other people. How can we leverage LLMs for faster, easier, more rapid cycles of experimentation? We invented the ABScribe tool [WP.5] that enables Super-Experiments (multiple concurrent experiments) on AdapComps in emails, text messages, and LinkedIn posts, by investigating how to scaffold people in effectively collaborating with LLMs to generate alternative versions. People without scientific training or programming skills were able to design multiple experiments and organize their decisions about writing, using an interface as natural as existing workflows like Google Docs.

Moreover, to enable anyone to get some benefits of designing experiments even without qualitative data, we are pioneering qualitative experiments where people collect rich qualitative feedback from experts and users on alternative versions of an experience. One user said: "I never considered writing out alternative versions [side by side] like we are doing. So even without [statistically] significant data, that will be a benefit".

We are designing tools to enable anyone to use LLMs as a 'thought partner' to reflect on alternative ways of communicating, and then AdaptEx to get qualitative and quantitative data about a range of topics: which emails motivate people to take action [AC.33], which text messages to send a friend or colleague when they are stressed [J.24], what questions to ask themselves as self-coaching to think through problems they face [J.30, WP.6], how to frame a LinkedIn post, and any other communication scenario that can advance people's professional and personal goals.

Just as we study how LLMs can provide access to better adaptive experiments, we study how adaptive experiments can provide access to better LLM-based chatbots [J.21], through using adaptive experiments for more rigorous prompt engineering. We're investigating tools for people without resources of openAI to do Super-Experiments that vary multiple dimensions of a prompt for an LLM-based chatbot, and then receive qualitative & quantitative data from deployments to thousands of people, such as crowd workers (WP.4), or students learning in a course from a TA. The impact of using AdaptEx to enable scientific approaches to prompt engineering was recognized in our DARPA Learning Tools Competition win. More LLM projects are described at tiny.cc/iaillm and slides from a Microsoft Convening.

(3) Adaptive Expt. for Personalization & Contextualization: Application in Behaviour Change for Mental & Physical Health

Societal Benefit: To ensure our techniques are broadly applicable, we explore applications to Behavior Change for Mental & Physical Health, in contexts from everyday mental wellbeing to clinical depression [J.17, AC.30], to diabetes prevention through coaching people to eat healthy & exercise [J.20, J.23]. This is represented in >15 papers and will remain an active area so long as the world suicide rate is nonzero, and people's core goals are being obstructed by behaviours and symptoms that don't serve them. Our goal is to enable a broad range of interventions and coaching messages, for many domains of learning to change health behaviours. These be explored using adaptive experimentation that allows any user, designer, or scientist to easily adapt content and delivery to the problem they are facing, with the option to access LLMs like with ABScribe [WP.5]. We therefore integrated AdaptEx with SMS texts – as a communication technology used worldwide – using a system used for 10 years by dozens of scientists for field deployments. One line of research uses this system with Northwestern Medical School to create text messaging interventions for mental wellbeing, usable by anyone who wants to improve their mood, as well as clinical populations [AC.30].

Computational Challenge: A key computational challenge we tackle is to use experiment-driven decision-making to go beyond a one-size-fits-all approach, to enable Personalization (whether A vs B seems better for Mary vs Bob) and Contextualization (whether A vs B seems better for Mary in context 1 vs context 2). We have identified what kinds of experimental variables, outcome variables, and contextual variables (tracking different contexts) are most effective for using these algorithms to help people [AC.38]. We have identified what kinds of algorithms and what parameter settings are best suited for these problems as well, such as contextual mulit-armed bandit algorithms that are randomized, like Thompson Sampling [AC.17]. Thompson Sampling acts as randomized probability matching – assigning actions in proportion to the probability, based on the data so far, they are better not just on average, but for the particular context a user is in. Randomized algorithms can improve reliable statistical inference (relative to deterministic algorithms) [AC.38]. Moreover, their interpretability is a key factor [AC.11, AC.21] as both practitioners and scientists can understand the algorithm's behaviour for a particular user,, as weighted randomization (e.g. 70/30 chance someone with low energy gets action A vs B, 20/80 chance someone with high energy gets action A vs B).

Personalization concerns going beyond identifying better and better interventions for the average of a population, to identifying better and better interventions for a person. We have found intuitively plausible personalization techniques surprisingly often aren't validated in experiments. When they do benefit, the tremendous data required makes them less useful in practice [ONR Grants, $925K 2018-2024]. We therefore identified empirical conditions under which different algorithms are more vs less likely to successfully personalize (AC.18, AC.22). Ongoing and future work further investigates conditions under which our algorithms can discover when what is best for the statistical majority of users is *worse* for an underrepresented statistical minority. Our Xprize work showed this can increase student outcomes by 25%, and increase statistical power to detect such an effect by 3%.

A key insight has been to focus on the data efficient and more empirically valuable approach of Contextualization – going beyond better and better interventions for a person, to identifying better and better interventions for a person in a particular context [CHI Best Paper, AC.36]. We empirically investigate when intervention A vs B vs N can be Contextualized. We used the AdaptEx paradigm to compare TextMsg A "Write a message using Cognitive Behaviour Therapy principles that has a story about how someone managed to leave the house, even though being down made it feel like running a marathon" against TextMsg B: "Here is a story about someone getting out of the house when they felt so down they couldn't: [Story]". This reveals findings like A might be better for Mary when she's low energy, and B better for Mary when she's high energy. With UCSF Medical School & UC Berkeley, we targeted Contextualization, by using AdaptEx for Super-Experiments (multiple concurrent experiments) on SMS messages to encourage exercise for people with diabetes and depression. These algorithms successfuly encouraged both students and older adults with diabetes and depression to walk more, providing physical and mood benefits (J.20 J Med Int Res 2021; J.18 Brit Med Journal 2020). Another successful line of work in Contextualization, uses our psychological theories of how people learn by explaining to themselves and others [J.1, J.2], to design reflective prompts for users to figure out how to Contextualize. The right prompt for them to explain helps them identify what they need to say to themselves to learn cognitive behaviour therapy techniques [J.7], and manage their mood [J.30]. We identified methods for them co-design with an LLM the messages to be sent to them, in order to reduce procrastination through self-experimentation [WP.6].

Managing diabetes and depression are tackled through tens of thousands of practitioners and scientists formulating them as biological and pharmacological problems. Our work is showing the value of technological and psychological solutions: through health behaviour change in coaching, nutrition, & exercise. Ongoing and future work investigates the design of workflows for competitions to design interventions by harnessing multidisciplinary science for societal benefit: One competition had 15 behavioural scientists collaboratively devise interventions to be deployed on Mental Health America, the largest nonprofit provider of free mental health screeners and DIY tools.

(4) Nobel-Prize Level Research: Adaptive Experimentation integrating (1) Empirical Sciences; (2) Algorithms; (3) Statistical Analyses

Why has no computer scientist won a Nobel Prize in the (Empirical) Sciences (e.g. Economics, Medicine, Chemistry)? (N.B. I wrote this before Geoff won one in Physics!). I am placing my bets that at least one will be awarded by 2050 for Adaptive Experimentation, to whichever multidisciplinary team best integrates a (Empirical) Science like Economics/Medicine/Chemistry, Machine Learning, and Statistics, to integrate the corresponding three pillars of adaptive experimentation: (1) Empirical Field Experiments; (2) Statistically Sensitive Algorithms; (3) Algorithm Attuned Statistical Analyses.

Our (1) Empirical field experiments and collaborations with domain scientists show how algorithms for Adaptive Experiments are both underused and overused. They are massively underused in many contexts where practitioners would be more open to experimentation if it was adaptive, as data is used to help users more quickly and accelerate testing of ideas (Section 1, 2, 3). A challenge is that designers and domain scientist are unsure of how to apply these algorithms, and when/how analysis of data from adaptive experiments is more vs less reliable. Relatedly, adaptive experiments can be overused by those unaware of research by us [AC.13] and others on rates of (1) false positives ,which can waste resources and hold back science [e.g. falsely believing an expensive intervention A is better than B]; (2) false negatives – e.g. missing a promising intervention B that is better than A. We empirically characterize when and how particular algorithms and statistical analyses increase false positives and negatives, because of slowing data collection to help participants too soon. This informs our development of (2) Statistically Sensitive Algorithms, with considerations around trading off helping people more quickly (e.g. maximizing reward) and minimizing false positive and false negative rates. One emerging class of algorithms [WP.3] blends traditional and adaptive experiments, reducing false positive rate and false negative rate with only minor reductions in Reward. We also investigate how to use or develop algorithms so they are interpretable to practitioners and domain scientists, and provide interfaces to parameters that enable human-in-the-loop use of domain knowledge for necessary tradeoffs. For example, to enable adjustment of parameters that: (1) Trade off experimentation with impact; (2) Trade off finding the best on average with discovering how to personalize; (3) Adjusting the metric being optimized for, to avoid problems like misinformation and polarization on social media [AC.29].

Finally, our development of (3) Algorithm Attuned Statistical Analyses modify or develop new frequentist & Bayesian Statistical Analyses, to better incorporate the properties of the algorithm used to collect data in an adaptive experiment, diverging from traditional experiments. For example, constructing hypothesis test statistics using the dynamics of the probability of assignment in Thompson Sampling, like an Algorithm-Induced Test Statistic [WP.1] or the Assignment Probability Test [WP.]. We show how these hypothesis tests can match false positive rate to traditional experiments, and identify conditions under which they yield better results than widely used methods. There are tens of thousands of papers on multi-armed bandit algorithms, and our work aims to better align the efforts of hundreds of machine learning researchers working on these algorithms, with Empirical Field Experiments and Statistical Analyses, to ensure they solve societal problems and accelerate science. Most recently we are expanding impact on Education and physical & mental health (Medicine), to adaptive experiments that enhance and personalize messaging to encourage prosocial behaviour.

Future work will realize the impact more broadly on the methods used in any empirical science, from Economics (where I became courtesy appointed, to work with professors there to map my existing findings and methods into impactful journals in their field), to Medicine (physical & mental health), to as far as Chemical/Materials Science & Engineering.

I was invited by the director of the Acceleration Consortium to join. This is a $200 million Canadian federal investment in accelerating scientific experimentation using ML/AI, with a focus on accelerating scientific experimentation in chemical/materials science & engineering. I further received a grant to investigate how our Statistically Sensitive Algorithms & Algorithm Attuned Statistical Analyses can be used for Empirical Field Experiments in both chemistry labs and clinical trials for sepsis treatments. The issues we have identified and are working on arise across any scientific field or practical endeavour that uses experiments to collect data.

Conclusion. I focus on Societally Beneficial Adaptive Experimentation, by pioneering paradigms for people to easily do experiments to make more effective decisions, expediting discovery of how to enhance and contextualize interventions for social good. These systems, tools & algorithms harness collective intelligence for better design in education (Section 1), co-design with LLMs (Section 2), and access to a more scientific basis for prompt engineering (Section 2). They also support contextualization and personalization in Behaviour Change for Mental & Physical Health (Section 3), and target acceleration of many empirical sciences to make discoveries that help everyone.

Social-Behavioural Science Focused Statement (Experimental Psychology, Marketing, Organizational Behaviour, Information Systems) – Coming Soon!

Industry-Focused Research Statement (For Companies I might do my sabbatical at) – Coming Soon!

Medical School Focused Research Statement – Coming Soon!

Statistics Department Focused Research Statement (Coming Later)

Economics Department Focused Research Statement (Coming Later)

Natural Sciences & Engineering (e.g. Chemistry, Materials Science, Engineering) Focused Research Statement (Coming Later)

We gratefully acknowledge support from our collaborators:

Page updated

Report abuse