Advances in multi-arm bandits and sequential decision making

30 March 2009, 9:00-17:00
The Institute for Mathematical Sciences
Imperial College London
London, UK

Sequential decision making is becoming increasingly important in a variety of applications. This application driven need is being met by new theoretical and computational developments, including the extension of existing frameworks to more realistic contexts. The purpose of this workshop is to explore some of these developments. We are pleased to announce that Professor Nicolo Cesa-Bianchi will provide a keynote address. The meeting will feature talks from researchers from the ALADDIN project and others.


All the presentations are available in this zip file


Nicolo Cesa-Bianchi, Universita degli Studi di Milano
"Bandit Algorithms for Online Linear Optimization"

Remi Munos, INRIA
"Online Optimistic Optimization"

David Leslie, Bristol University
"Posterior weighted reinforcement learning with state uncertainty"

Steve Roberts, University of Oxford
"Sequential Decision making in uncertain environments"

Maike Kaufman , University of Oxford
"Multi-agent decision making under sparse communication"

Mike Osborne , University of Oxford
"Gaussian Process Global Optimisation"

Mark Ebden , University of Oxford
"Decentralized dynamic coalition formation"

Erol Gelenbe,Imperial College London
"Learning based adaptive routing in networks"

Nicos Pavlidis, Imperial College London
"Multi-armed bandits with covariates"

Adam Sykulski , Imperial College London
"Optimal learning using e-greedy algorithms in one-armed bandit problems"

Dave Nicholson, BAe Systems
"Bandit applications in the battlespace"

Georgios Chalkiadakis, University of Southampton
"Decentralized Bayesian Reinforcement Learning for Online Agent Collaboration"

Archie Chapman, University of Southampton
"Learning Nash Equilibria in Potential Games with Perturbed Rewards"

Enrique Munoz de Cote, University of Southampton
"Social Reward Shaping"


The draft programme is as follows:
9:00-9:25 Registration
9:25-9:30 Opening
9:30-10:30 Nicolo Cesa-Bianchi, "Bandit Algorithms for Online Linear Optimization"
10:30-11:00 Coffee
11:00-11:30 Adam Sykulski"Optimal learning using e-greedy algorithms in one-armed bandit problems"
11:30-12:00 David Leslie, "Posterior weighted reinforcement learning with state uncertainty"
12:00-12:15 Georgios Chalkiadakis, "Decentralized Bayesian Reinforcement Learning for Online Agent Collaboration"
12:15-12:30 Archie Chapman, "Learning Nash Equilibria in Stochastic Potential Games"
12:30-12:45Mark Ebden "Decentralized dynamic coalition formation"
12:45-13:00Mike Osborne "Gaussian Process Global Optimisation"
13:00-14:00 Lunch
14:00-14:30 Erol Gelenbe, "Learning based adaptive routing in networks"
14:30-14:45 Steve Roberts"Sequential Decision making in uncertain environments"
14:45-15:00Maike Kaufman "Multi-agent decision making under sparse communication"
15:00-15:15 Nicos Pavlidis, "Multi-armed bandits with covariates"
15:15-15:30 Enrique Munoz de Cote,"Social Reward Shaping"
15:30-15:45 Dave Nicholson,"Bandit applications in the battlespace"
15:45-16:00 Coffee
16:00-17:00 Remi Munos, "Online Optimistic Optimization"


The meeting has no registration fee, thanks to generous funding from the ALADDIN project. The meeting is open to all. Pre-registration is required. You can register by email:
Information about getting to Imperial is available at

Imperial college conference link ( ) provides some possibilities should you need overnight accommodation.


Posterior weighted reinforcement learning with state uncertainty

David Leslie

Reinforcement learning models are, in essence, online algorithms to estimate the expected reward in each of a set of states by allocating observed rewards to states and calculating averages. Generally it is assumed that a learner can unambiguously identify the state of nature. However in any natural environment the state information is noisy, so that the learner cannot be certain about the current state of nature. Under state uncertainty it is no longer immediately obvious how to perform reinforcement learning, since the observed reward cannot be unambiguously allocated to a particular state of the environment. A new technique, posterior weighted reinforcement learning, is introduced. In this process the reinforcement learning updates are weighted according to the posterior state probabilities, calculated after observation of the reward. We show that this modified algorithm can converge to correct reward estimates, and show the procedure to be a variant of an online expectation-maximisation algorithm, allowing further analysis to be carried out.

Bandit Algorithms for Online Linear Optimization

Nicolo Cesa-Bianchi

In the online linear optimization problem a forecaster chooses, at each time instance, a vector x from a certain given subset S of the D-dimensional Euclidean space and suffers a time-dependent loss that is linear in x. The goal of the forecaster is to achieve that, in the long run, the accumulated loss is not much larger than that of the best possible vector in S. In this talk we consider the "bandit" setting of the linear optimization problem, in which the forecaster has only access to the losses of the chosen vectors. We survey some recent algorithms that solve this problem. For the special case in which S is a subset of the d-dimensional Boolean hypercube, we describe a new forecaster whose performance, for a variety of concrete choices of S, is better than all previously known bounds, and not improvable in general. We also point out that computationally efficient implementations for various interesting choices of S exist.
Joint work with Gabor Lugosi (Barcelona).

Online Optimistic Optimization

Remi Munos

The "optimism in the face of uncertainty" principle has been recently successfully applied to complex numerical optimization and sequential decision making problems, e.g., the game of Go. The idea is to use regret-minimization algorithms (the so-called bandits) to explore efficiently the set of possible strategies given noisy evaluations of their performance. In this talk I will discuss the Upper Confidence Bounds (UCB) approach when the number of possible strategies is large (and possibly infinite), and present the idea of Hierarchical Optimistic Optimization (HOO) algorithms.
Some related references are: [Coquelin and Munos, UAI 2007], [Bubeck, Munos, Stoltz, Szepesvari, NIPS 2008], [Hren and Munos, EWRL 2008]

Decentralized Bayesian Reinforcement Learning for Online Agent Collaboration

Georgios Chalkiadakis

Solving complex but structured problems in a decentralised manner via multiagent collaboration is a task that has received much attention in recent years. The problem becomes particularly challenging in uncertain domains, where the agents face the additional need to learn the environment parameters. This is exactly the case in sensor network environments; however, related research has so far ignored the existence of online collaborative learning techniques and revert to offline methods to update sensor coordination schedules given knowledge obtained through a common calibration phase. At the same time, existing collaborative online learning techniques are point-based, thus providing a suboptimal solution to the exploration-exploitation problem of the agents. Against this background, we integrate a generalization of the well-known message-passing max-sum agent coordination algorithm with tractable Bayesian reinforcement learning methods we propose, in order to tackle the agents sequential decision making problem. In a nutshell, our approach is to compute the global multi-agent MDP describing the underlying coordination problem via the evaluation of its component multi-agent MDPs. As a result, we effectively achieve to efficiently learn an approximate solution to the global MDP in a decentralized manner.
This is joint work with W.T.Luke Teacy, Alessandro Farinelli, Alex Rogers and N.R. Jennings

Learning Nash Equilibria in Potential Games with Perturbed Rewards

Archie Chapman

Potential games are a useful model of multi-agent systems, encompassing problems such as distributed constraint optimisation, task and resource allocation, and congestion models. However, little work has focused on the case where agents need to explore the joint action space to estimate their rewards for different action profiles. As such, this talk will cover some recent work extending convergence proofs for fictitious play-like algorithms to problems of sequential decision making in potential games with unknown perturbed payoffs when Q-learning is used to estimate the reward functions.

Learning based adaptive behaviour in networks

Erol Gelenbe

Adaptive packet routing requires that a series of sequential decisions be made, each one of them among a finite set of alternatives (choice of the next node to be visited), for each packet that is forwarded from one node to another, towards the destination of the packet. In a very large network, the choice of the next node to be visited by the packet is far from obvious. When routing is carried out as a function of quality of service (QoS) the cost incurred at each step can be, for instance, the random delay required to traverse a node, and in addition packet loss may also occur forcing a duplicate copy of the packet to be retransmitted some time later (after a "time-out") from the original source of the packet. Our presentation will briefly outline two points: (1) how can standard multi-arm bandit models be applied to QoS routing, and how can classical results such as Gittins index based optimality be used in this problem, and (2) how can the multi-arm bandit problem be generalised to address the specific structure of QoS routing.

Bandit applications in the battlespace

Dave Nicholson

Sensor management is considered as an application focus for multi-armed bandits in the battlespace. While modern agile sensors offer many control options these are usually constrained by the battlespace environment: lots of targets to monitor, limited fields of view, limited time on targets due to deadlines or exposure to threat. Other factors include choice of performance objective (search, detect, track, classify), time horizon (finite or infinite discounted), uncertainy models, and moving or stationary targets. Sensor management problems are typically solved with sub-optimal myopic algorithms, sacrificing performance for efficiency. We present a sample of battlespace sensor management problems and discuss under what conditions a multi-arm bandits approach could provide efficient optimal or near-optimal solutions.

Social Reward Shaping

Enrique Munoz de Cote

Imagine two learning agents (A and B) that repeatedly encounter each other on some specific situation (say on an iPD game). It is well known that pairing two learning agents, both using a table lookup TD paradigm (e.g. Q-learning) are not guaranteed to learn a specific pair of strategies. But what happens if (say) agent A acts as a 'learning leader' trying to guide the learning process of agent B? I will present experimental evidence on how this situation arises and some future directions on how to efficiently guide the coupled learning process.

Multi-armed bandits with covariates

Nicos Pavlidis

We study the sequential action selection problem in information rich and dynamic environments through the multi--armed bandit with covariates framework. In this generalisation of the multi--armed bandit, the expected rewards are functions of the covariate vector. The agent learns to associate the covariate with the optimal action, essentially learning to partition the covariate space. This formulation provides a more realistic representation of most real sequential decision making problems. We also consider dynamic problems in which the agent is unaware of the nature of the dynamics and tracks the evolving environment only via the covariate and the rewards. This creates difficulties not encountered in static problems, and changes the exploitation--exploration dilemma.

Optimal learning using e-greedy algorithms in one-armed bandit problems

Adam Sykulski

A one-armed bandit problem with covariates is introduced in which the agent must choose whether or not to pull the arm based on an observed covariate. For cases where the conditional expected reward is negative the agent would prefer not to pull the arm. The agent must learn the coefficients of the reward function, and thus faces the exploration-exploitation dilemma. For a 1-dimensional covariate, we prove that the optimal learning parameter using an e-greedy algorithm is actually e=0, meaning that greedy always outperforms e-greedy, regardless of the amount of noise in the reward function. We discuss how this result generalises in n-dimensions. We also discuss using e-first algorithms and demonstrate how non-zero values of e can be optimal.

Sequential Decision making in uncertain environments

Steve Roberts

We consider sequential decision making as part of iterative Bayesian modelling. This talk will present the underlying theory, extend it to practical and computationally efficient algorithms and present its efficacy on both test and real data. The methods developed ar robust to missing observations and labels and are capable of making n-step ahead decision forecasts.

Multi-agent decision making under sparse communication"

Maike Kaufman

Most approaches to solving and designing Multi-agent systems assume either full (and free) communication between agents or none at all. We will illustrate an approach to dealing with sparse inter-agent communication, based on Bayesian Prediction and discuss how this can be tied in with a more general treatment of Multi-agent systems with variable levels of communication.

Gaussian Process Global Optimisation

Mike Osborne

We introduce a novel Bayesian approach to global optimization using Gaussian processes. We frame the optimization of both noisy and noiseless functions as sequential decision problems, and introduce myopic and non-myopic solutions to them. Here our solutions can be tailored to exactly the degree of confidence we require of them. The use of Gaussian processes allows us to benefit from the incorporation of prior knowledge about our objective function, and also from any derivative observations. Using this latter fact, we introduce an innovative method to combat conditioning problems. Our algorithm demonstrates a significant improvement over its competitors in overall performance across a wide range of canonical test problems. Our current work considers sequential decision making and strategy formation as a sequential optimisation process.

Decentralized dynamic coalition formation

Mark Ebden

The problem of assigning multiple sensors to multiple moving objects has known, efficient solutions if the sensor networks are small and if the information about environmental dynamics is only partially utilized. We present a method of decentralized dynamic coalition formation in such networks when these two conditions are relaxed. First, the method's compatability with belief propagation protocol eliminates limits on the network size. Second, the method's approach to sequential decision making avoids combinatorial explosion over a time horizon, allowing environmental forecasts with arbitrarily high or low confidence to be incorporated when choosing sensor states. We suggest the method is appropriate for certain problems in job scheduling and operations research.