I propose, analyze, and apply algorithms that learn incrementally, run in real time, and converge to near optimal solutions as the number of observations increases. Most of my recent work focuses on designing multi-armed bandit algorithms for structured real-world problems.

Exploration-exploitation trade-off is a fundamental trade-off in any learning problem, between taking exploration actions that lead to learning a better model and taking exploitation actions that have the highest reward under the latest model estimate. This trade-off is often modeled as a multi-armed bandit. A multi-armed bandit is an online learning problem where actions of the learning agent are arms. In practice, the arms can be treatments in a clinical trial or ads on a website. After the arm is pulled, the agent receives its reward. The goal of the agent is to maximize its cumulative reward. The agent does not know the rewards of the arms in advance and faces the so-called exploration-exploitation dilemma: explore, and learn more about the arms; or exploit, and pull the arm with the highest estimated reward thus far.