Research Statement

[Last updated February 2017)

Intelligent self-improving systems driven by dynamic, personalized, collaborative experimentation

Online software has been hailed for its potential to help society, by providing scalable resources that enhance education, health, and the economy. The promise of Massive Open Online Courses was to revolutionize education and supercharge human capital in the 21st century economy, by enabling people to learn anything from K12 math to programming to new job skills. For public health, apps that change behavioral habits could prevent physical ailments like obesity and mental ailments like depression.

    Unfortunately, the ever-increasing numbers of educational and behavior change apps rarely achieve their promise. Many students fail to learn, and few people succeed in changing health habits, because many intuitions about how to design technology are often disproved by data. Solving the most challenging societal problems requires a deep understanding of human behavior, which often does not yet exist. I aim to tackle problems that are too complex to solve without doing research into real-world human cognition. One of my application domains is education, where my work has impacted over 400,000 learners in platforms from Khan Academy to edX.

My research program creates intelligent self-improving systems that help people learn new concepts and change habitual behavior. These systems enhance and personalize technology, by automatically analyzing data from randomized experiments that investigate how people learn and reason. This requires using computational cognitive science and Bayesian statistics to bridge human-computer interaction with machine learning.

   An example in education is the application of a theory of learning through explaining (developed in my dissertation) to create a system for crowdsourcing and automatically experimenting with explanations. The system enhanced learning from math problems as much as an expert instructor [LAS 2016]. This illustrates my broader approach: My cognitive science research uses online experiments to investigate the processes underlying people's real-world learning and thinking, in tandem with 'A/B' testing ideas for improving people's experience. Data about which experimental conditions (e.g. explanations) optimize target metrics (e.g. learning and problem-solving accuracy) are modeled using Bayesian statistics. These models integrate with machine learning algorithms that dynamically choose the best conditions in order to enhance and personalize the experience of each subsequent user. To continually generate new experimental conditions for a system to test, I crowdsource contributions from users and designers. My self-improving systems are powered by combining human intelligence – in generating hypotheses that can be tested with data – with statistical machine learning – to automate rapid iteration and improvement.

1. Experimentation as an engine for self-improving systems

Randomized 'A/B' experiments are ubiquitous in psychology laboratories for developing theory, and increasingly used in real-world online environments to 'A/B' test how alternative software designs impact people's behavior. To turn every online experiment into a self-improving system, I apply statistical machine learning to manage the tradeoff between 'exploration' (assigning conditions to users to explore which are best) and 'exploitation' (assigning the conditions that look best so far).

One illustrative application is the Adaptive eXplanation Improvement System (AXIS) system, which I built to provide explanations to students while they were solving math problems [ACM LAS 2016; bit.ly/tedxwilliams]. Explanations were chosen as a fruitful target for optimization through experimentation based on my cognitive science research [Uncertainty in Artificial Intelligence 2013; Cognitive Science 2010]. Students often memorize procedures for getting answers but fail to generalize to new problems, if they do not have explanations for why these procedures are used. To generate actions/explanations that could be provided to students, I designed a crowdsourcing workflow that created explanations as a by-product of student's reflective interaction with a problem (see section 3. Human Computation). A dynamic experiment investigated which explanations students found helpful for learning.

I formalized the challenge of turning experimental comparisons of explanations into data-driven improvement as a multi-armed bandit problem [e.g., Liu 2014]. The dynamic experiment in AXIS included the set of system actions or arms, which were the alternative explanations presented as A or B conditions, and the reward function to be optimized, which was student rating of explanations on a 1 to 10 scale. I used the Bayesian algorithm Thompson Sampling to learn the policy for choosing which actions (explanations) to perform by building a statistical model of each action's reward (a Beta-Binomial model for explanation rating). As a randomized probability matching algorithm, each action is chosen with the probability that the action has the highest reward, based on the current data and model. The deployment of AXIS matched the benefits of having an expert instructor handcraft explanations, resulting in a substantial 9% increase in student learning.

2. Systems that discover how to personalize

Systems for dynamic online experiments could also enable personalization, in the sense of delivering a different condition to users with different profiles, or even to the same user at different times. With sufficient data about user characteristics, any experiment can detect 'subject-treatment' interactions, where the effect of a condition is different based on features of a user. Scientifically, discovering these interactions refines general theories to account for individual differences between people. Practically, self-improving systems could use experimentation to discover how to personalize actions to different subgroups of users. The computational challenge is that personalization requires tackling a higher dimensional tradeoff between exploration and exploitation. Actions that are currently optimal on average might turn out to be suboptimal for a subgroup of users, as more data is collected or more diverse users arrive.

I faced this problem in a system I developed to maximize the number of people in an online course who responded to a emails requesting feedback on the course. To discover how to personalize, the system dynamically adapted an experiment that compared alternative phrasings of motivational emails [EDM 2015; DynExpPers working paper]. My goal was to continue collecting data about emails that were 'suboptimal' on average but might turn out to be optimal for subgroups of users. Building on my past work using Gaussian Process regression to model human cognition [NIPS 2008], I used an approximation of Bayesian optimization that explored in proportion to the magnitude of the reward. For example, consider response rates to 3 (of the 27) versions of an email: A (11.5%), B (11.2%), C (6.1%). Once variance is minimized, pure optimization would give nearly everyone A. Although my method assigned most people to A, the next most common assignment was B, and then C. This provided sufficient data to analyze how response rate to an email depended on contextual variables (e.g. number of days a user was active in the course). This additional data revealed that email B led to the greatest response rates for highly active users, despite being suboptimal on average, which enabled my system to dynamically transition from randomizing to personalizing delivery of email messages. My personalized experimentation approach increased the number of responses by 10%.


My future work will explore the tremendous opportunities in dynamically transitioning from randomization to personalization of conditions in experiments. These issues are of interest to a wide range of researchers in cognitive science, machine learning [NIPS 2015 workshop], and applied statistics [ACIC 2016 talk]. Every online experiment could provide the opportunity to optimize user outcomes without adding any new actions/conditions. Instead, sophisticated use of data allows more effective delivery of existing actions to the people who benefit most. My Future Work discusses how I plan to extend this approach into contexts for health behavior change, in the spirit of personalized medicine and adaptive clinical trials [Berry 2006].

3. Leveraging users for human computation

The impact of technology can be enhanced if its users are not treated as just passive recipients, but active contributors. I build systems that incorporate human computation – the scalable and systematic use of human minds to perform computation – in service of the system goals.

3.1 Crowdsourcing improvements from users

Many crowdsourcing methods use low-skill paid workers from sites like Amazon Mechanical Turk, who are rarely equipped to generate novel actions for systems to enhance education and health. Users of a technology have additional experience, and could serve as a computational resource in suggesting improvements. I applied psychological theories to design a crowdsourcing workflow that motivate users to contribute as a by-product of their interaction, and elicits user contributions in a format that is appropriate for automatic experimentation without pre-processing. I applied this approach to get learners to generate explanations in Section 1's AXIS system. Learners were informed that writing explanations for why answers to problems were correct would help the learning [CHI 2016; Cognitive Science 2010]. Many did so and found it useful. At the same time, the explanations generated were a source of novel system actions, that could help future learners. Such incentive-compatible crowdsourcing workflows harnesses untapped but skilled resources, by aligning a user's own self-interest with the goals of system improvement. A diverse crowd of users is also well-placed to contribute actions that allow a system to personalize experiences, in ways no single designer can anticipate. My hope is to replace annoying and vague feedback surveys with interactions that engage users in the automated improvement of websites, desktop software, and apps for changing health habits like exercise.

3.2 Self-feedback

Users often have misconceptions or incorrect ideas that a system needs to revise by providing feedback, through methods like artificial intelligence [Aleven 2002] or crowdsourcing [Kulkarni 2015]. What if we could harness a user as a computational resource to revise their own incorrect beliefs? [CHI 2016] Achieving this required using insights from my cognitive science dissertation, which revealed how prompting people to explain "why?" helped them generate new knowledge, even without being told whether their explanation was correct. I designed reflective question prompts that caused learners to generate self-feedback, so that they revised their incorrect beliefs about variability (a major challenge in understanding statistics). This self-feedback resulted in learners becoming 35% more accurate at solving problems [CHI 2016]. As a novel human computation method, reflective question prompts for self-feedback scales directly with the numbers of users. The design of prompts is also broadly applicable and easily implemented as a method for helping people learn and revise misunderstandings. Applications can be as diverse as as reading websites, figuring out a new graphical interface, understanding nutrition, and formulating daily plans.

4. Foundational cognitive science

Much of the motivation for system design comes from my research developing psychological theories.

4.1 Bayesian Models of Human Judgments

My mathematical psychology/computational cognitive science research applied Bayesian statistics and machine learning to model cognition. For example, I built a Bayesian model that demonstrated how many of people's errors in reasoning about randomness were caused by rational use of statistically ambiguous information [JEP: LMC 2013]. I also analyzed people's judgments about what makes good causal explanations using Bayesian networks [UAI, 2013] and used Gaussian Processes to model how people learn functions [NIPS, 2008]. I now use related models and algorithms for dynamic experimentation and personalization.

4.2 Theories of Explanation and Learning

My dissertation built on philosophical theories and computational models of what makes for good explanations or answers to "why?" questions. I proposed the novel Subsumptive Constraints account of how generating explanations could help people learn, even without any external input. I showed that explaining didn't just increase engagement or attention, but selectively drove people to discover new underlying patterns, which provided a basis for generalizing to new situations [Cognitive Science 2010]. Generating explanations brought people's prior knowledge to bear in reasoning and solving problems [Cognitive Psychology 2013]. Counterintuitively, I found that the very mechanism that helped learning – discovering patterns – could sometimes harm learning, when data was sparse and prior knowledge lead to misleading overgeneralizations [JEP: General 2013]. Such counterintuitive findings reveal the value of cognitive science in helping designers test their intuitions.

Future work: making self-improving systems ubiquitous

My future work will broaden the impact of intelligent self-improving systems that use experimentation to bridge research and practice, by (1) generalizing to new domains, and (2) designing tools for any researcher and practitioner to build these systems and conduct dynamic, personalized experiments.

1. Health behavior change and other applications

I aim to create self-improving systems for the broad range of technologies that can be enhanced and personalized through experiments, including online mental health, health behavior change, website testing, marketing. My past collaborations with clinical researchers conducted small-scale studies to prevent depression, by combining cognitive behavioral therapy (CBT) with reflective questions [Behavior Therapy and Experimental Psychiatry 2015]. My future work will use CBT to develop self-improving apps for prevention and treatment of mental health issues, such as depression, anxiety, ADHD, and autism.

Moreover, many physical health problems (e.g. obesity, diabetes) have psychological roots, in behaviors like eating and habits like medication adherence. People rely excessively on willpower to change habits, and so users and designers of health behavior change apps would benefit from the effective but counterintuitive methods identified in experimental research [Shafir, 2013]. Building on the success of cognitive behavioral therapy (CBT) in changing physical health habits in small-scale offline settings [Hayes et al, 1999], my research will use dynamic experiments in web applications and mobile apps to discover how to generalize CBT principles to change health habits and behaviors, such as eating, exercise, smoking, and medication adherence.

2. Enabling computational and behavioral science research in the real world

To democratize the creation of self-improving systems, my future work will design tools for dynamic experimentation and personalization, which enable collaboration between machine learning researchers, social-behavioral scientists, and designers. I applied for a second NSF cyberinfrastructure grant (under review), which will develop and evaluate a software requirements specification for experimentation tools that are dynamic, personalized, and collaborative, providing an API for statistical machine learning and crowdsourcing.

Ecologically Valid Social-Behavioral Science and Ethical Experimentation. How can we lower the barriers for social-behavioral scientists to conduct experiments in real-world technology? This would open new frontiers for asking scientific questions in real-world environments, and bring rigorous statistical methods and decades of theory to bear on helping users. After receiving a grant with Neil Heffernan [NSF Cyberinfrastructure, 2014-2016], we investigated how to design tools for education and psychology researchers to embed experiments within online K-12 math homework, without disruption to students and teachers. 11 researchers conducted studies with 5000 K-12 students, leading to three publications that discovered new ways to improve learning.

   This grant informs two key directions for my future work: 1. Which tools and interaction techniques support the collaborative design of experiments by academic researchers and practical designers? 2. How can machine learning be used for ethical experimentation? I will investigate methods for dynamically modifying an experiment to balance designers' ethical goal of giving users the best conditions as soon as possible (akin to exploitation), against researchers' goal of drawing valid statistical inferences about differences between conditions (akin to exploration). My first step has been to build a system for collaborative, dynamic experimentation in online quizzes [CDE working paper]. I and 4 faculty at Harvard used it to successfully design collaborative experiments that were deployed in their on-campus courses.

Evaluating machine learning algorithms in real-world systems. To transform a wide variety of online experiments into engines for self-improving systems, machine learning researchers need to go beyond using simulated and offline data, and do real-time tests of which algorithms effectively solve real world exploration-exploitation problems. My future work will create tools that provide ML researchers real-time API access to obtain data and adapt experiments. For example, I developed a web-app for experimenting with and recommending lessons and problems in online courses [DynExpPers working paper; RecSys 2016 poster], which provides API access to algorithms for multi-armed bandits, reinforcement learning, and Bayesian optimization. These tools can drive machine learning beyond passive pattern discovery into discovering the best actions to take. Algorithms typically used for discovering patterns in existing data (e.g. deep learning, random causal forests, SVMs) will be evaluated by their predictive accuracy in choosing actions on new users, and their capacity to learn by guiding data collection.

Bridging Computational and Behavioral Science: Interpretable, Interactive Machine Learning. To support successful use by behavioral scientists and designers, machine learning needs to be interpretable and interactive. Black-box algorithms obscure designers' understanding of their users' experiences, and scientists' interpretation and statistical analysis of their data. Consider one example of how my future work on interpretable machine learning can address this. I will investigate which algorithms for data-driven experimentation are readily understood and adopted by designers and scientists. Are there preferences for algorithms that are Bayesian versus Frequentist, randomized versus deterministic, based on probability matching versus upper-confidence bounds?

   My interactive machine learning work will enable designers and scientists to help algorithms learn. For example, I will investigate when systems for dynamic experimentation benefit from integrating human intelligence – in choosing which metrics to optimize, setting parameters governing exploration-exploitation, and encoding prior knowledge about which conditions will benefit particular user subgroups.

Conclusion. I reimagine online A/B experiments as engines for dynamic enhancement, personalization, and collaboration. This will enable my research agenda to create intelligent self-improving systems that perpetually enhance and personalize people's education, health, and everyday experiences with technology.

References

[ACIC 2016 talk] Williams, J. J. (2016). Adaptive experimentation in online user technologies: from randomized assignment to optimization, and from heterogeneous treatment effects to personalization. Talk presented at the 2016 Atlantic Causal Inference Conference. New York, NY.

[ACM LAS 2016] Williams, J. J., Kim, J., Rafferty, A., Maldonado, S., Gajos, K., Lasecki, W. S., & Heffernan, N. (2016). AXIS: Generating Explanations at Scale with Learnersourcing and Machine Learning. Proceedings of the Third Annual ACM Conference on Learning at Scale, 379-388. *Nominee for Best Paper

[Behavior Therapy and Experimental Psychiatry 2015] Gumport, N. B., Williams, J. J., & Harvey, A. G. (2015). Learning cognitive behavior therapy. Journal of behavior therapy and experimental psychiatry, 48, 164-169.

[CDE working paper] Williams, J. J., Rafferty, A., Gajos, K. Z., Tingley, D. Lasecki, W. S., & Kim, J. (working paper). Connecting Instructors and Learning Scientists via Collaborative Dynamic Experimentation.

[CHI 2016] Williams, J. J., Lombrozo, T., Hsu, A., Huber, B., & Kim, J. (2016). Revising Learner Misconceptions Without Feedback: Prompting for Reflection on Anomalous Facts. Proceedings of CHI (2016), 34th Annual ACM Conference on Human Factors in Computing Systems. *Honorable Mention for Best Paper (top 5%)

[Cognitive Psychology 2013] Williams, J. J., & Lombrozo, T. (2013). Explanation and prior knowledge interact to guide learning. Cognitive Psychology, 66, 55-84.

[Cognitive Science 2010] Williams, J. J., & Lombrozo, T. (2010). The role of explanation in discovery and generalization: evidence from category learning. Cognitive Science, 34, 776-806.

[DynExpPers working paper] Williams, J. J., Rafferty, A., Maldonado, S., Ang, A., Tingley, D., & Kim, J. (working paper). Designing Tools for Dynamic Experimentation and Personalization.

[EDM 2015] Whitehill, J., Williams, J. J., Lopez, G., Coleman, C., & Reich, J. (2015). Beyond Prediction: First Steps Toward Automatic Intervention in MOOC Student Stopout. In Proceedings of the 8th International Conference on Educational Data Mining. Madrid, Spain. *Nominee for Best Paper

[JEP: General 2013] Williams, J. J., Lombrozo, T., & Rehder, B. (2013). The hazards of explanation: overgeneralization in the face of exceptions. Journal of Experimental Psychology: General, 142(4), 1006-1014.

[JEP: LMC 2013] Williams, J. J., & Griffiths, T. L. (2013). Why are people bad at detecting randomness? A statistical argument. Journal of Experimental Psychology: Learning, Memory, and Cognition, 39, 1473-1490.

[NIPS 2015 workshop] Williams, J. J., Abbasi, Y., Doshi-Velez, F. (2015). Machine Learning From and For Adaptive User Technologies: From Active Learning & Experimentation to Optimization & Personalization. 29th Annual Conference on Neural Information Processing Systems (NIPS).

[NIPS 2008] Griffiths, T. L., Lucas, C. G., Williams, J. J., Kalish, M. L. (2008). Modeling human function learning with Gaussian processes. Advances in Neural Information Processing Systems 21.

[NSF Cyberinfrastructure 2014-2016] SI2-SSE. Adding Research Accounts to the ASSISTments Platform: Helping Researchers Do Randomized Controlled Studies with Thousands of Students. (1440753)  $486,000. 2014 - 2016. Co-Principal Investigator.

[Recsys 2016 poster] Williams, J.J., & Hoang, L. (2016). Combining Dynamic A/B Experimentation and Recommender Systems in MOOCs. Poster presented to the 10th Annual Conference on Recommender Systems (RecSys '16).

[UAI 2013] Pacer, M., Williams, J. J., Chen, X., Lombrozo, T., Griffiths, T. L. (2013). Evaluating computational models of explanation using human judgments. In Nicholson, A., & Smythe, P. (Eds.), Proceedings of the Twenty Ninth Conference on Uncertainty in Artificial Intelligence, 498 - 507.

[Aleven 2002] Aleven, V. A., & Koedinger, K. R. (2002). An effective metacognitive strategy: Learning by doing and explaining with a computer-based Cognitive Tutor. Cognitive science, 26(2), 147-179.

[Berry 2006] Berry, D. A. (2006). Bayesian clinical trials. Nature reviews Drug discovery, 5(1), 27-36.

[Hayes et al 1999] Hayes, S. C., Strosahl, K. D., & Wilson, K. G. (1999). Acceptance and commitment therapy: An experiential approach to behavior change. Guilford Press.

[Kulkarni 2015] Kulkarni, C., Wei, K. P., Le, H., Chia, D., Papadopoulos, K., Cheng, J., & Klemmer, S. R. (2015). Peer and self assessment in massive online classes. In Design thinking research (pp. 131-168). Springer International Publishing.

[Liu 2014] Liu, Y. E., Mandel, T., Brunskill, E., & Popovic, Z. (2014, July). Trading off scientific knowledge and user learning with multi-armed bandits. In Educational Data Mining 2014.Chicago

[Schwartz 2004] Schwartz, D. L., & Martin, T. (2004). Inventing to prepare for future learning: The hidden efficiency of encouraging original student production in statistics instruction. Cognition and Instruction, 22(2), 129-184.

[Shafir 2013] Shafir, E. (Ed.). (2013). The behavioral foundations of public policy. Princeton University Press.