obp.policy.offline¶
Offline Bandit Algorithms.
Classes
|
Off-policy learner with Inverse Probability Weighting. |
-
class
obp.policy.offline.
IPWLearner
(n_actions: int, len_list: int = 1, base_classifier: Optional[sklearn.base.ClassifierMixin] = None)[source]¶ Bases:
obp.policy.base.BaseOfflinePolicyLearner
Off-policy learner with Inverse Probability Weighting.
- Parameters
n_actions (int) – Number of actions.
len_list (int, default=1) – Length of a list of actions recommended in each impression. When Open Bandit Dataset is used, 3 should be set.
base_classifier (ClassifierMixin) – Machine learning classifier used to train an offline decision making policy.
References
Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. “Doubly Robust Policy Evaluation and Optimization.”, 2014.
Damien Lefortier, Adith Swaminathan, Xiaotao Gu, Thorsten Joachims, and Maarten de Rijke. “Large-scale Validation of Counterfactual Learning Methods: A Test-Bed.”, 2016.
-
fit
(context: numpy.ndarray, action: numpy.ndarray, reward: numpy.ndarray, pscore: Optional[numpy.ndarray] = None, position: Optional[numpy.ndarray] = None) → None[source]¶ Fits an offline bandit policy using the given logged bandit feedback data.
Note
This fit method trains a deterministic policy \(\pi: \mathcal{X} \rightarrow \mathcal{A}\) via a cost-sensitive classification reduction as follows:
\[\begin{split}\hat{\pi} & \in \arg \max_{\pi \in \Pi} \hat{V}_{\mathrm{IPW}} (\pi ; \mathcal{D}) \\ & = \arg \max_{\pi \in \Pi} \mathbb{E}_{\mathcal{D}} \left[\frac{\mathbb{I} \{\pi (x_{i})=a_{i} \}}{\pi_{b}(a_{i} | x_{i})} r_{i} \right] \\ & = \arg \min_{\pi \in \Pi} \mathbb{E}_{\mathcal{D}} \left[\frac{r_i}{\pi_{b}(a_{i} | x_{i})} \mathbb{I} \{\pi (x_{i}) \neq a_{i} \} \right],\end{split}\]where \(\mathbb{E}_{\mathcal{D}} [\cdot]\) is the empirical average over observations in \(\mathcal{D}\). See the reference for the details.
- Parameters
context (array-like, shape (n_rounds, dim_context)) – Context vectors in each round, i.e., \(x_t\).
action (array-like, shape (n_rounds,)) – Action sampled by a behavior policy in each round of the logged bandit feedback, i.e., \(a_t\).
reward (array-like, shape (n_rounds,)) – Observed rewards (or outcome) in each round, i.e., \(r_t\).
pscore (array-like, shape (n_rounds,), default=None) – Action choice probabilities by a behavior policy (propensity scores), i.e., \(\pi_b(a_t|x_t)\).
position (array-like, shape (n_rounds,), default=None) – Positions of each round in the given logged bandit feedback. If None is given, a learner assumes that there is only one position. When len_list > 1, position has to be set.
-
predict
(context: numpy.ndarray) → numpy.ndarray[source]¶ Predict best actions for new data.
Note
Action set predicted by this predict method can contain duplicate items. If you want a non-repetitive action set, then please use the sample_action method.
- Parameters
context (array-like, shape (n_rounds_of_new_data, dim_context)) – Context vectors for new data.
- Returns
action_dist – Action choices by a classifier, which can contain duplicate items. If you want a non-repetitive action set, please use the sample_action method.
- Return type
array-like, shape (n_rounds_of_new_data, n_actions, len_list)
-
predict_proba
(context: numpy.ndarray, tau: Union[int, float] = 1.0) → numpy.ndarray[source]¶ Obtains action choice probabilities for new data based on scores predicted by a classifier.
Note
This predict_proba method obtains action choice probabilities for new data \(x \in \mathcal{X}\) by first computing non-negative scores for all possible candidate actions \(a \in \mathcal{A}\) (where \(\mathcal{A}\) is an action set), and using a Plackett-Luce ranking model as follows:
\[P (A = a | x) = \frac{\mathrm{exp}(f(x,a) / \tau)}{\sum_{a^{\prime} \in \mathcal{A}} \mathrm{exp}(f(x,a^{\prime}) / \tau)},\]where \(A\) is a random variable representing an action, and \(\tau\) is a temperature hyperparameter. \(f: \mathcal{X} \times \mathcal{A} \rightarrow \mathbb{R}_{+}\) is a scoring function which is now implemented in the predict_score method.
Note that this method can be used only when `len_list=1`, please use the `sample_action` method otherwise.
- Parameters
context (array-like, shape (n_rounds_of_new_data, dim_context)) – Context vectors for new data.
tau (int or float, default=1.0) – A temperature parameter, controlling the randomness of the action choice. As \(\tau \rightarrow \infty\), the algorithm will select arms uniformly at random.
- Returns
choice_prob – Action choice probabilities obtained by a trained classifier.
- Return type
array-like, shape (n_rounds_of_new_data, n_actions, len_list)
-
predict_score
(context: numpy.ndarray) → numpy.ndarray[source]¶ Predict non-negative scores for all possible products of action and position.
- Parameters
context (array-like, shape (n_rounds_of_new_data, dim_context)) – Context vectors for new data.
- Returns
score_predicted – Scores for all possible pairs of action and position predicted by a classifier.
- Return type
array-like, shape (n_rounds_of_new_data, n_actions, len_list)
-
sample_action
(context: numpy.ndarray, tau: Union[int, float] = 1.0, random_state: Optional[int] = None) → numpy.ndarray[source]¶ Sample (non-repetitive) actions based on scores predicted by a classifier.
Note
This sample_action method samples a non-repetitive set of actions for new data \(x \in \mathcal{X}\) by first computing non-negative scores for all possible candidate products of action and position \((a, k) \in \mathcal{A} \times \mathcal{K}\) (where \(\mathcal{A}\) is an action set and \(\mathcal{K}\) is a position set), and using softmax function as follows:
\[\begin{split}& P (A_1 = a_1 | x) = \frac{\mathrm{exp}(f(x,a_1,1) / \tau)}{\sum_{a^{\prime} \in \mathcal{A}} \mathrm{exp}( f(x,a^{\prime},1) / \tau)} , \\ & P (A_2 = a_2 | A_1 = a_1, x) = \frac{\mathrm{exp}(f(x,a_2,2) / \tau)}{\sum_{a^{\prime} \in \mathcal{A} \backslash \{a_1\}} \mathrm{exp}(f(x,a^{\prime},2) / \tau )} , \ldots\end{split}\]where \(A_k\) is a random variable representing an action at a position \(k\). \(\tau\) is a temperature hyperparameter. \(f: \mathcal{X} \times \mathcal{A} \times \mathcal{K} \rightarrow \mathbb{R}_{+}\) is a scoring function which is now implemented in the predict_score method.
- Parameters
context (array-like, shape (n_rounds_of_new_data, dim_context)) – Context vectors for new data.
tau (int or float, default=1.0) – A temperature parameter, controlling the randomness of the action choice. As \(\tau \rightarrow \infty\), the algorithm will select arms uniformly at random.
random_state (int, default=None) – Controls the random seed in sampling actions.
- Returns
action – Action sampled by a trained classifier.
- Return type
array-like, shape (n_rounds_of_new_data, n_actions, len_list)
-
property
policy_type
¶ Type of the bandit policy.