Contextual Bandits

Contextual bandit algorithms are powerful reinforcement learning techniques that enable personalization and user-focused design. They balance the exploration of new possibilities with the exploitation of the best existing options to learn and act optimally. Our lab conducts research on contextual bandits and their applications to various areas of human-computer interaction, such as encouraging people to exercise (where we use them to balance showing people new motivational messages with showing them those that have been proven to be effective in the past).

Mental Health and Stress Management Interventions

Mental health is an area where personalization and user-focused design can be especially beneficial for people. With collaborators from Northwestern University, we are working on a digital intervention for Mental Health America which consists of treatment modules that involve psychoeducation material, interactive activities, and supportive messaging. Using machine learning, we will personalize content by determining which messages/prompts are most effective and engaging for users. Additionally, we are working on deploying and measuring the effects of TenQ, a condensed ten-question survey informed by cognitive behavioural therapy meant to help people reflect on their mental health, with the goal of lowering the barrier to accessing mental health resources.


We see great potential in the application of intelligent adaptive interventions to helping people stick to their exercise schedules. We are working with Goodlife Fitness to apply findings from the psychology of motivation and self-control in an automated text messaging system, specifically through the use of reinforcement learning to send participants automated motivational text messages to encourage them to go to the gym and help them meet their exercise goals.

Personalized Explanations

Different students are likely to get different levels of learning outcome even if they receive the same prompt or message. For example, although there is a large body of academic literature showing that students learn better when they write explanations of the concepts they are learning, some of them might not be affected in the real world because they rush to finish their problem set without time to reflect. In the personalized explanations project, we design randomized experiments to examine what educators should tell students in what context to improve their learning and engagement. We then apply contextual bandits to tailor students’ experience on online education systems. We embed this system in the University of Toronto’s PCRS online learning system.

Student MetaSkills interventions

We are investigating the effects of multiple interventions intended to improve first year students’ “metaskills”, which are transferable skills that can help students in multiple areas of their lives (e.g. planning, growth mindset, stress management, etc). These interventions have been studied separately, but it is not yet known what interactions these interventions have together, or whether there are crossover effects. Students receive a random subset of these “metaskills” modules, while also receiving generic study advice in the control (null subset) condition. We plan to measure the effects of these interventions on midterm and final grades, as well as on student mindset.

Statistical Inference with Multi-armed Bandit Algorithms

Multi-armed bandit algorithms (MAB) maximize expected reward, whereas randomized experiments maximize statistical power and control type 1 error rate. Randomized experiments may not be ideal in the context of deciding which version of an online educational technology to present to students (ie. text vs. video explanations), since students may receive inferior versions of this technology during the experiment. MAB algorithms are thus appealing, as maximizing reward is done by assigning more students the better version. However, MAB algorithms have been shown to inflate type 1 error rate, and reduce power. In this project, we are working on techniques to control the type 1 error rate (while minimizing loss of power) of MAB algorithms in online educational experiments.

Factorial Experiment Design

Factorial design is a systematic way of designing experiments used by scientists. The purpose is to examine experimental variables (a.k.a. factors) to see if and how each of them affects the outcome. We frame it as a bandit problem and use Thompson Sampling to approach it. Our goal is to evaluate how Thompson Sampling performs in different problem settings as well as provide insights on details and nuances of such solutions.

Contextual Bandits with non-stationarity in factorial designs

Real world data collected through multiple time points are very complex in general, often including non-stationarity problems which have to be appropriately modeled. This is particularly true when study designs get more complicated, as in the case of multi-factorial designs with several levels for each factor (e.g., the DIAMANTE Study). If the goal of the study is to estimate the relationship, or more in depth the causal relationship, between an outcome of interest and some interventions, with the final goal being to identify the best intervention, typically, (contextual) multi-armed bandits are used. 

However, while there’s a broad literature dealing with algorithms’ theoretical regret bounds, in some cases accounting also for non-stationarity, non-stationarity in real world settings based on complex factorial designs has not yet been addressed. In this project, we want to investigate through (real world data-based) simulations, the performance of some of the existing bandit algorithms in different types of non-stationary scenarios. The ultimate goal would be to develop a new bandit algorithm which may appropriately incorporate this problem in mHealth, where non-stationarity is manifested in habituation phenomena.