Overview

Setup

We consider a general contextual bandit setting. Let \(r \in [0, R_{\mathrm{max}}]\) denote a reward or outcome variable (e.g., whether a fashion item as an action results in a click). We let \(x \in \calX\) be a context vector (e.g., the user’s demographic profile) that the decision maker observes when picking an action. Rewards and contexts are sampled from the unknown probability distributions \(p (r \mid x, a)\) and \(p(x)\), respectively. Let \(\calA:=\{0,\ldots,m\}\) be a finite set of \(m+1\) actions. We call a function \(\pi: \calX \rightarrow \Delta(\calA)\) a policy. It maps each context \(x \in \calX\) into a distribution over actions, where \(\pi (a \mid x)\) is the probability of taking action \(a\) given \(x\).

Let \(\calD := \{(x_t,a_t,r_t)\}_{t=1}^{T} \) rounds of observations. \(a_t\) is a discrete variable indicating which action in \(\calA\) is chosen in round \(t\). \(r_t\) and \(x_t\) denote the reward and the context observed in round \(t\), respectively. We assume that a logged bandit feedback is generated by a behavior policy \(\pi_b\) as follows:

\[\{(x_t,a_t,r_t)\}_{i=1}^{T} \sim \prod_{i=1}^{T} p(x_t) \pi_b (a_t \mid x_t) p(r_t \mid x_t, a_t),\]

where each context-action-reward triplets are sampled independently from the product distribution. Note that we assume \(a_t\) is independent of \(r_t\) conditional on \(x_t\).

We let \(\pi(x,a,r) := p(x) \pi (a \mid x) p(r \mid x, a)\) be the product distribution by a policy \(\pi\). For a function \(f(x,a,r)\), we use \(\E_{\calD} [f] := |\calD|^{-1} \sum_{(x_t, a_t, r_t) \in \calD} f(x_t, a_t, r_t)\) to denote its empirical expectation over \(T\) observations in \(\calD\). Then, for a function \(g(x,a)\), we let \(g(x,\pi) := \E_{a \sim \pi(a|x)}[g(x,a) \mid x]\). We also use \(q(x,a) := \E_{r \sim p(r|x,a)} [ r \mid x, a ]\) to denote the mean reward function.

Estimation Target

We are interested in using the historical logged bandit data to estimate the following policy value of any given evaluation policy \(\pi_e\) which might be different from \(\pi_b\):

\[V (\pi_e) := \E_{(x,a,r) \sim \pi_e (x,a,r)} [r] .\]

where the last equality uses the independence of \(A\) and \(Y(\cdot)\) conditional on \(X\) and the definition of \(\pi_b(\cdot|X)\). We allow the evaluation policy \(\pi_e\) to be degenerate, i.e., it may choose a particular action with probability 1. Estimating \(V(\pi_e)\) before implementing \(\pi_e\) in an online environment is valuable because \(\pi_e\) may perform poorly and damage user satisfaction. Additionally, it is possible to select an evaluation policy that maximizes the policy value by comparing their estimated performances without having additional implementation cost.