obp.policy.offline

Offline Bandit Algorithms.

Classes

IPWLearner(n_actions, len_list, base_classifier)

Off-policy learner with Inverse Probability Weighting.

class obp.policy.offline.IPWLearner(n_actions: int, len_list: int = 1, base_classifier: Optional[sklearn.base.ClassifierMixin] = None)[source]

Bases: obp.policy.base.BaseOfflinePolicyLearner

Off-policy learner with Inverse Probability Weighting.

Parameters
  • n_actions (int) – Number of actions.

  • len_list (int, default=1) – Length of a list of actions recommended in each impression. When Open Bandit Dataset is used, 3 should be set.

  • base_classifier (ClassifierMixin) – Machine learning classifier used to train an offline decision making policy.

References

Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. “Doubly Robust Policy Evaluation and Optimization.”, 2014.

Damien Lefortier, Adith Swaminathan, Xiaotao Gu, Thorsten Joachims, and Maarten de Rijke. “Large-scale Validation of Counterfactual Learning Methods: A Test-Bed.”, 2016.

fit(context: numpy.ndarray, action: numpy.ndarray, reward: numpy.ndarray, pscore: Optional[numpy.ndarray] = None, position: Optional[numpy.ndarray] = None) → None[source]

Fits an offline bandit policy using the given logged bandit feedback data.

Note

This fit method trains a deterministic policy \(\pi: \mathcal{X} \rightarrow \mathcal{A}\) via a cost-sensitive classification reduction as follows:

\[\begin{split}\hat{\pi} & \in \arg \max_{\pi \in \Pi} \hat{V}_{\mathrm{IPW}} (\pi ; \mathcal{D}) \\ & = \arg \max_{\pi \in \Pi} \mathbb{E}_{\mathcal{D}} \left[\frac{\mathbb{I} \{\pi (x_{i})=a_{i} \}}{\pi_{b}(a_{i} | x_{i})} r_{i} \right] \\ & = \arg \min_{\pi \in \Pi} \mathbb{E}_{\mathcal{D}} \left[\frac{r_i}{\pi_{b}(a_{i} | x_{i})} \mathbb{I} \{\pi (x_{i}) \neq a_{i} \} \right],\end{split}\]

where \(\mathbb{E}_{\mathcal{D}} [\cdot]\) is the empirical average over observations in \(\mathcal{D}\). See the reference for the details.

Parameters
  • context (array-like, shape (n_rounds, dim_context)) – Context vectors in each round, i.e., \(x_t\).

  • action (array-like, shape (n_rounds,)) – Action sampled by a behavior policy in each round of the logged bandit feedback, i.e., \(a_t\).

  • reward (array-like, shape (n_rounds,)) – Observed rewards (or outcome) in each round, i.e., \(r_t\).

  • pscore (array-like, shape (n_rounds,), default=None) – Action choice probabilities by a behavior policy (propensity scores), i.e., \(\pi_b(a_t|x_t)\).

  • position (array-like, shape (n_rounds,), default=None) – Positions of each round in the given logged bandit feedback. If None is given, a learner assumes that there is only one position. When len_list > 1, position has to be set.

predict(context: numpy.ndarray) → numpy.ndarray[source]

Predict best actions for new data.

Note

Action set predicted by this predict method can contain duplicate items. If you want a non-repetitive action set, then please use the sample_action method.

Parameters

context (array-like, shape (n_rounds_of_new_data, dim_context)) – Context vectors for new data.

Returns

action_dist – Action choices by a classifier, which can contain duplicate items. If you want a non-repetitive action set, please use the sample_action method.

Return type

array-like, shape (n_rounds_of_new_data, n_actions, len_list)

predict_proba(context: numpy.ndarray, tau: Union[int, float] = 1.0) → numpy.ndarray[source]

Obtains action choice probabilities for new data based on scores predicted by a classifier.

Note

This predict_proba method obtains action choice probabilities for new data \(x \in \mathcal{X}\) by first computing non-negative scores for all possible candidate actions \(a \in \mathcal{A}\) (where \(\mathcal{A}\) is an action set), and using a Plackett-Luce ranking model as follows:

\[P (A = a | x) = \frac{\mathrm{exp}(f(x,a) / \tau)}{\sum_{a^{\prime} \in \mathcal{A}} \mathrm{exp}(f(x,a^{\prime}) / \tau)},\]

where \(A\) is a random variable representing an action, and \(\tau\) is a temperature hyperparameter. \(f: \mathcal{X} \times \mathcal{A} \rightarrow \mathbb{R}_{+}\) is a scoring function which is now implemented in the predict_score method.

Note that this method can be used only when `len_list=1`, please use the `sample_action` method otherwise.

Parameters
  • context (array-like, shape (n_rounds_of_new_data, dim_context)) – Context vectors for new data.

  • tau (int or float, default=1.0) – A temperature parameter, controlling the randomness of the action choice. As \(\tau \rightarrow \infty\), the algorithm will select arms uniformly at random.

Returns

choice_prob – Action choice probabilities obtained by a trained classifier.

Return type

array-like, shape (n_rounds_of_new_data, n_actions, len_list)

predict_score(context: numpy.ndarray) → numpy.ndarray[source]

Predict non-negative scores for all possible products of action and position.

Parameters

context (array-like, shape (n_rounds_of_new_data, dim_context)) – Context vectors for new data.

Returns

score_predicted – Scores for all possible pairs of action and position predicted by a classifier.

Return type

array-like, shape (n_rounds_of_new_data, n_actions, len_list)

sample_action(context: numpy.ndarray, tau: Union[int, float] = 1.0, random_state: Optional[int] = None) → numpy.ndarray[source]

Sample (non-repetitive) actions based on scores predicted by a classifier.

Note

This sample_action method samples a non-repetitive set of actions for new data \(x \in \mathcal{X}\) by first computing non-negative scores for all possible candidate products of action and position \((a, k) \in \mathcal{A} \times \mathcal{K}\) (where \(\mathcal{A}\) is an action set and \(\mathcal{K}\) is a position set), and using softmax function as follows:

\[\begin{split}& P (A_1 = a_1 | x) = \frac{\mathrm{exp}(f(x,a_1,1) / \tau)}{\sum_{a^{\prime} \in \mathcal{A}} \mathrm{exp}( f(x,a^{\prime},1) / \tau)} , \\ & P (A_2 = a_2 | A_1 = a_1, x) = \frac{\mathrm{exp}(f(x,a_2,2) / \tau)}{\sum_{a^{\prime} \in \mathcal{A} \backslash \{a_1\}} \mathrm{exp}(f(x,a^{\prime},2) / \tau )} , \ldots\end{split}\]

where \(A_k\) is a random variable representing an action at a position \(k\). \(\tau\) is a temperature hyperparameter. \(f: \mathcal{X} \times \mathcal{A} \times \mathcal{K} \rightarrow \mathbb{R}_{+}\) is a scoring function which is now implemented in the predict_score method.

Parameters
  • context (array-like, shape (n_rounds_of_new_data, dim_context)) – Context vectors for new data.

  • tau (int or float, default=1.0) – A temperature parameter, controlling the randomness of the action choice. As \(\tau \rightarrow \infty\), the algorithm will select arms uniformly at random.

  • random_state (int, default=None) – Controls the random seed in sampling actions.

Returns

action – Action sampled by a trained classifier.

Return type

array-like, shape (n_rounds_of_new_data, n_actions, len_list)

property policy_type

Type of the bandit policy.