Intelligent self-improving systems driven by dynamic, personalized, collaborative experimentation
Online software has been hailed for its potential to help society, by providing scalable resources that enhance education, health, and the economy. The promise of Massive Open Online Courses was to revolutionize education and supercharge human capital in the 21st century economy, by enabling people to learn anything from K12 math to programming to new job skills. For public health, apps that change behavioral habits could prevent physical ailments like obesity and mental ailments like depression.
Unfortunately, the ever-increasing numbers of educational and behavior change apps rarely achieve their promise. Many students fail to learn, and few people succeed in changing health habits, because many intuitions about how to design technology are often disproved by data. The most challenging societal problems require an understanding of human behavior that does not exist yet, and so require conducting research on human cognition in real-world contexts. To realize technology's potential to help people, I believe the practical design of online software needs to involve scientific-grade experimentation and data analysis. By taking this approach in education, my work has impacted over 400,000 learners, in platforms from Khan Academy to edX.
My research agenda creates intelligent self-improving systems that conduct dynamic experiments to discover how to optimize and personalize technology, helping people learn new concepts and change habitual behavior. This requires using computational cognitive science and Bayesian statistics to bridge human-computer interaction with machine learning.
For example, I created a system for automatically experimenting with explanations, which enhanced learning from math problems as much as an expert instructor [LAS 2016]. Another system boosted people's responses to an email campaign, by dynamically discovering how to personalize motivational messages to a user's activity level [EDM 2015].
These successful applications are enabled by my integrative approach: I use my cognitive science theories in deciding the target actions for experimentation (e.g. explanations, motivational messages) and the metrics to optimize (e.g. student ratings, response rates). To generate new actions I design crowdsourcing workflows, leveraging my human-computer interaction research. Data from experiments is analyzed using methods from Bayesian statistics, and algorithms from machine learning are used to turn data into dynamic enhancement and personalization of users' experiences. My self-improving systems are powered by combining human intelligence – in generating hypotheses that can be tested with data – with statistical machine learning – to automate rapid iteration and improvement.
Large scale A/B/N experiments are often cited as an application for machine learning algorithms for reinforcement learning and Bayesian Optimization, which manage the tradeoff between exploration (assigning conditions to users to explore which are best) and exploitation (assigning the conditions that look best so far). But algorithms are largely evaluated using offline or simulated data, and it's rare for online experimentation infrastructure to enable dynamic optimization via machine learning. Bridging this gap enables the creation of self-improving systems. I have largely explored these in education, but my future work will generalize to the many domains where experimentation can enhance technology.
One illustrative application is the Adaptive eXplanation Improvement System (AXIS) system, which I built to provide explanations to students while they were solving math problems [ACM LAS 2016; bit.ly/tedxwilliams]. Explanations were chosen as a fruitful target for optimization through experimentation based on my cognitive science research [Uncertainty in Artificial Intelligence 2013; Cognitive Science 2010]. Students often memorize procedures for getting answers but fail to generalize to new problems, if they do not have explanations for why these procedures are used. To overcome the 'cold-start' problem and generate new actions/explanations, I designed a crowdsourcing workflow that created explanations as a by-product of student's reflective interaction with a problem (see section 3. Human Computation).
I formalized the challenge of turning experimental comparisons of explanations into data-driven improvement as a multi-armed bandit problem [e.g., Liu 2014]. The dynamic experiment in AXIS included the set of system actions or arms, which were the alternative explanations presented as A or B conditions, and the reward function to be optimized, which was student rating of explanations on a 1 to 10 scale. I used the Bayesian algorithm Thompson Sampling to learn the policy for choosing which actions (explanations) to perform by building a statistical model of each action's reward (a Beta-Binomial model for explanation rating). As a randomized probability matching algorithm, each action is chosen with the probability that the action has the highest reward, based on the current data and model. The deployment of AXIS matched the benefits of having an expert instructor handcraft explanations, resulting in a substantial 9% increase in student learning.
Users clamor for more personalized technology, although past work has shown that extensive research and development is often required for designers to achieve benefits beyond a one-size-fits-all approach. Self-improving systems have a tremendous opportunity to not only optimize for what is best on average, but to use experimentation to discover how to personalize actions to different subgroups of users. The computational challenge is that personalization requires tackling a higher dimensional tradeoff between exploration and exploitation. Actions that are currently optimal on average might turn out to be suboptimal for a subgroup of users, as more data is collected or more diverse users arrive.
I faced this problem in a system I developed to maximize the number of people in an online course who responded to a emails requesting feedback on the course. To discover how to personalize, the system dynamically adapted an experiment that compared alternative phrasings of motivational emails [EDM 2015; ACM LAS, under review]. My goal was to continue collecting data about emails that were 'suboptimal' on average but might turn out to be optimal for subgroups of users. Building on my past work using Gaussian Process regression to model human cognition [NIPS 2008], I used an approximation of Bayesian optimization that explored in proportion to the magnitude of the reward. For example, consider response rates to 3 (of the 27) versions of an email: A (11.5%), B (11.2%), C (6.1%). Once variance is minimized, pure optimization would give nearly everyone A. Although my method assigned most people to A, the next most common assignment was B, and then C. This provided sufficient data to analyze how response rate to an email depended on contextual variables (e.g. number of days a user was active in the course). This additional data revealed that email B led to the greatest response rates for highly active users, despite being suboptimal on average, which enabled my system to dynamically transition from randomizing to personalizing delivery of email messages. My personalized experimentation approach increased the number of responses by 10%.
My future work will explore the tremendous opportunities in dynamically transitioning from randomization to personalization of conditions in experiments. These issues are of interest to a wide range of researchers in cognitive science, machine learning [NIPS 2015 workshop], and applied statistics [ACIC 2016 talk]. Every online experiment could provide the opportunity to optimize user outcomes without adding new actions/conditions, but by sophisticated use of data to discover who should receive which conditions. My Future Work discusses how I plan to extend this approach into contexts for health behavior change, in the spirit of personalized medicine and adaptive clinical trials [Berry 2006].
Technology that helps people frequently requires the generation of new actions or approaches, like creating new reflective questions or motivational messages. I therefore build systems that incorporate human computation – the scalable and systematic use of human minds to perform computation in service of a system's goals.
People often have misconceptions or incorrect ideas that teachers attempt to revise by providing feedback, such as telling students about their mistaken beliefs about variability in statistics [Schwartz 2004]. Providing feedback through artificial intelligence systems [Aleven 2002] or crowdsourcing from peers [Kulkarni 2015] misses the opportunity to harness a user themself as a computational resource. I applied my cognitive science research to design reflective question prompts which guided learners to generate self-feedback. I prompted learners to explain "why?" statistical facts were true. Although learners received no input on their explanations, learners were able to revise their beliefs and construct new knowledge, leading to up to a 35% increase in solving new problems [CHI 2016]. Reflective question prompts are a broadly applicable and easily implemented method for helping people learn and revise misunderstandings, in settings as diverse as reading websites, figuring out a new graphical interface, understanding nutrition, and starting new exercise habits.
Many crowdsourcing methods used low-skill paid workers from sites like Amazon Mechanical Turk, who are less equipped to generate novel actions for systems to enhance education and health. Users of a technology can serve as a computational resource in suggesting improvements. I applied psychological theories in designing crowdsourcing workflows that motivates users to contribute as a by-product of their interaction, and elicits user contributions that are appropriate for automatic experimentation without pre-processing. I applied this approach to get learners to generate explanations in Section 1's AXIS system. Learners were asked to explain why answers to problems were correct because it benefited their own learning [CHI 2016; Cognitive Science 2010]. Many did so, and reported the value of this reflection. At the same time, the explanations that were generated served as novel actions that could be tested out for their benefit to help future learners. Such incentive-compatible crowdsourcing workflows harnesses untapped but skilled resources by aligning a user's own self-interest with the goals of system improvement. Diverse users are also well-placed to contribute actions that allow a system to personalize experiences, in ways no single designer can anticipate. My hope is to replace annoying and vague feedback surveys with interactions that engage users in automated improvement of websites, desktop software, and apps for changing health habits like exercise.
My future work will broaden the impact of intelligent self-improving systems, by (1) generalizing to new domains, and (2) designing tools for any researcher and practitioner to build these systems.
I aim to create self-improving systems for the broad range of technologies that can be enhanced and personalized through experiments, including online mental health, health behavior change, website testing, marketing. My past collaborations with clinical researchers conducted small-scale studies to prevent depression, by combining cognitive behavioral therapy (CBT) with reflective questions [Behavior Therapy and Experimental Psychiatry 2015]. My future work will use CBT to develop self-improving apps for prevention and treatment of mental health issues, such as depression, anxiety, ADHD, and autism.
Moreover, many physical health problems (e.g. obesity, diabetes) have psychological roots, in behaviors like eating and habits like medication adherence. People rely excessively on willpower to change habits, and so users and designers of health behavior change apps would benefit from the effective but counterintuitive methods identified in experimental research [Shafir, 2013]. Building on the success of cognitive behavioral therapy (CBT) in changing physical health habits in small-scale offline settings [Hayes et al, 1999], my research will use dynamic experiments in web applications and mobile apps to discover how to generalize CBT principles to change health habits and behaviors, such as eating, exercise, smoking, and medication adherence.
To democratize the creation of self-improving systems, my future work will design tools for collaboration between machine learning researchers, social-behavioral scientists, and designers. My first step has been to develop a software requirements specification for dynamic, personalized, and collaborative experiments that provide an API for machine learning and crowdsourcing, which is the basis for an NSF grant (under review).
Ecologically Valid Social-Behavioral Science and Ethical Experimentation. How can we lower the barriers for social-behavioral scientists to conduct experiments in real-world technology? This would open new frontiers for asking scientific questions in real-world environments, and bring rigorous statistical methods and decades of theory to bear on helping users. After receiving a grant with Neil Heffernan [NSF Cyberinfrastructure, 2014-2016], we investigated how to design tools for education and psychology researchers to embed experiments within online K-12 math homework, without disruption to students and teachers. 11 researchers conducted studies with 5000 K-12 students, leading to three publications that discovered new ways to improve learning.
This grant informs two key directions for my future work: 1. Which tools and interaction techniques support the collaborative design of experiments by academic researchers and practical designers? 2. How can machine learning be used for ethical experimentation? I will investigate methods for dynamically modifying an experiment to balance designers' ethical goal of giving users the best conditions as soon as possible (akin to exploitation), against researchers' goal of drawing valid statistical inferences about differences between conditions (akin to exploration). My first step has been to build a system for collaborative, dynamic experimentation in online quizzes [CHI under review]. I and 4 faculty at Harvard used it to successfully design collaborative experiments that were deployed in their on-campus courses.
Evaluating machine learning algorithms in real-world systems. To transform a wide variety of online experiments into engines for self-improving systems, machine learning researchers need to go beyond using simulated and offline data, and do real-time tests of which algorithms effectively solve real world exploration-exploitation problems. My future work will create tools that provide ML researchers real-time API access to obtain data and adapt experiments. For example, I developed a web-app for experimenting with and recommending lessons and problems in online courses [ACM LAS under review; RecSys 2016 poster], which provides API access to algorithms for multi-armed bandits, reinforcement learning, and Bayesian optimization. These tools can drive machine learning beyond passive pattern discovery into discovering the best actions to take. Algorithms typically used for discovering patterns in existing data (e.g. deep learning, random causal forests, SVMs) will be evaluated by their predictive accuracy in choosing actions on new users, and their capacity to learn by guiding data collection.
Bridging Computational and Behavioral Science: Interpretable, Interactive Machine Learning. To support successful use by behavioral scientists and designers, machine learning needs to be interpretable and interactive. Black-box algorithms obscure designers' understanding of their users' experiences, and scientists' interpretation and statistical analysis of their data. Consider one example of how my future work on interpretable machine learning can address this. I will investigate which algorithms for data-driven experimentation are readily understood and adopted by designers and scientists. Are there preferences for algorithms that are Bayesian versus Frequentist, randomized versus deterministic, based on probability matching versus upper-confidence bounds?
My interactive machine learning work will enable designers and scientists to help algorithms learn. For example, I will investigate when systems for dynamic experimentation benefit from integrating human intelligence – in choosing which metrics to optimize, setting parameters governing exploration-exploitation, and encoding prior knowledge about which conditions will benefit particular user subgroups.
Conclusion. I reimagine online A/B experiments as engines for dynamic enhancement, personalization, and collaboration. This will enable my research agenda to create intelligent self-improving systems that perpetually enhance and personalize people's education, health, and everyday experiences with technology.
[ACIC 2016 talk] Williams, J. J. (2016). Adaptive experimentation in online user technologies: from randomized assignment to optimization, and from heterogeneous treatment effects to personalization. Talk presented at the 2016 Atlantic Causal Inference Conference. New York, NY.
[ACM LAS 2016] Williams, J. J., Kim, J., Rafferty, A., Maldonado, S., Gajos, K., Lasecki, W. S., & Heffernan, N. (2016). AXIS: Generating Explanations at Scale with Learnersourcing and Machine Learning. Proceedings of the Third Annual ACM Conference on Learning at Scale, 379-388. *Nominee for Best Paper
[ACM LAS under review] Williams, J. J., Rafferty, A., Maldonado, S., Ang, A., Tingley, D., & Kim, J. (under review). Designing Tools for Dynamic Experimentation and Personalization. Submitted to the Fourth Annual ACM Conference on Learning at Scale.
[Behavior Therapy and Experimental Psychiatry 2015] Gumport, N. B., Williams, J. J., & Harvey, A. G. (2015). Learning cognitive behavior therapy. Journal of behavior therapy and experimental psychiatry, 48, 164-169.
[CHI 2016] Williams, J. J., Lombrozo, T., Hsu, A., Huber, B., & Kim, J. (2016). Revising Learner Misconceptions Without Feedback: Prompting for Reflection on Anomalous Facts. Proceedings of CHI (2016), 34th Annual ACM Conference on Human Factors in Computing Systems. *Honorable Mention for Best Paper (top 5%)
[CHI under review] Williams, J. J., Rafferty, A., Gajos, K. Z., Tingley, D. Lasecki, W. S., & Kim, J. (under review). Connecting Instructors and Learning Scientists via Collaborative Dynamic Experimentation. Submitted to CHI 2017, 35th Annual ACM Conference on Human Factors in Computing Systems.
[Cognitive Science 2010] Williams, J. J., & Lombrozo, T. (2010). The role of explanation in discovery and generalization: evidence from category learning. Cognitive Science, 34, 776-806.
[EDM 2015] Whitehill, J., Williams, J. J., Lopez, G., Coleman, C., & Reich, J. (2015). Beyond Prediction: First Steps Toward Automatic Intervention in MOOC Student Stopout. In Proceedings of the 8th International Conference on Educational Data Mining. Madrid, Spain. *Nominee for Best Paper
[NIPS 2015 workshop] Williams, J. J., Abbasi, Y., Doshi-Velez, F. (2015). Machine Learning From and For Adaptive User Technologies: From Active Learning & Experimentation to Optimization & Personalization. 29th Annual Conference on Neural Information Processing Systems (NIPS).
[NIPS 2008] Griffiths, T. L., Lucas, C. G., Williams, J. J., Kalish, M. L. (2008). Modeling human function learning with Gaussian processes. Advances in Neural Information Processing Systems 21.
[NSF Cyberinfrastructure 2014-2016] SI2-SSE. Adding Research Accounts to the ASSISTments Platform: Helping Researchers Do Randomized Controlled Studies with Thousands of Students. (1440753) $486,000. 2014 - 2016. Co-Principal Investigator.
[Recsys 2016 poster] Williams, J.J., & Hoang, L. (2016). Combining Dynamic A/B Experimentation and Recommender Systems in MOOCs. Poster presented to the 10th Annual Conference on Recommender Systems (RecSys '16).
[Aleven 2002] Aleven, V. A., & Koedinger, K. R. (2002). An effective metacognitive strategy: Learning by doing and explaining with a computer-based Cognitive Tutor. Cognitive science, 26(2), 147-179.
[Berry 2006] Berry, D. A. (2006). Bayesian clinical trials. Nature reviews Drug discovery, 5(1), 27-36.
[Hayes et al 1999] Hayes, S. C., Strosahl, K. D., & Wilson, K. G. (1999). Acceptance and commitment therapy: An experiential approach to behavior change. Guilford Press.
[Kulkarni 2015] Kulkarni, C., Wei, K. P., Le, H., Chia, D., Papadopoulos, K., Cheng, J., & Klemmer, S. R. (2015). Peer and self assessment in massive online classes. In Design thinking research (pp. 131-168). Springer International Publishing.
[Liu 2014] Liu, Y. E., Mandel, T., Brunskill, E., & Popovic, Z. (2014, July). Trading off scientific knowledge and user learning with multi-armed bandits. In Educational Data Mining 2014.Chicago
[Schwartz 2004] Schwartz, D. L., & Martin, T. (2004). Inventing to prepare for future learning: The hidden efficiency of encouraging original student production in statistics instruction. Cognition and Instruction, 22(2), 129-184.
[Shafir 2013] Shafir, E. (Ed.). (2013). The behavioral foundations of public policy. Princeton University Press.