obp.ope.regression_model¶
Regression Model Class for Estimating Mean Reward Functions.
Classes
|
Machine learning model to estimate the mean reward function (\(q(x,a):= \mathbb{E}[r|x,a]\)). |
-
class
obp.ope.regression_model.
RegressionModel
(base_model: sklearn.base.BaseEstimator, n_actions: int, len_list: int = 1, action_context: Optional[numpy.ndarray] = None, fitting_method: str = 'normal')[source]¶ Bases:
sklearn.base.BaseEstimator
Machine learning model to estimate the mean reward function (\(q(x,a):= \mathbb{E}[r|x,a]\)).
Note
Reward (or outcome) \(r\) must be either binary or continuous.
- Parameters
base_model (BaseEstimator) – A machine learning model used to estimate the mean reward function.
n_actions (int) – Number of actions.
len_list (int, default=1) – Length of a list of actions recommended in each impression. When Open Bandit Dataset is used, 3 should be set.
action_context (array-like, shape (n_actions, dim_action_context), default=None) – Context vector characterizing each action, vector representation of each action. If not given, then one-hot encoding of the action variable is automatically used.
fitting_method (str, default=’normal’) – Method to fit the regression model. Must be one of [‘normal’, ‘iw’, ‘mrdr’] where ‘iw’ stands for importance weighting and ‘mrdr’ stands for more robust doubly robust.
References
Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. “More Robust Doubly Robust Off-policy Evaluation.”, 2018.
Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudik. “Doubly Robust Off-Policy Evaluation with Shrinkage.”, 2020.
Yusuke Narita, Shota Yasui, and Kohei Yata. “Off-policy Bandit and Reinforcement Learning.”, 2020.
-
fit
(context: numpy.ndarray, action: numpy.ndarray, reward: numpy.ndarray, pscore: Optional[numpy.ndarray] = None, position: Optional[numpy.ndarray] = None, action_dist: Optional[numpy.ndarray] = None) → None[source]¶ Fit the regression model on given logged bandit feedback data.
- Parameters
context (array-like, shape (n_rounds, dim_context)) – Context vectors in each round, i.e., \(x_t\).
action (array-like, shape (n_rounds,)) – Action sampled by a behavior policy in each round of the logged bandit feedback, i.e., \(a_t\).
reward (array-like, shape (n_rounds,)) – Observed rewards (or outcome) in each round, i.e., \(r_t\).
pscore (array-like, shape (n_rounds,), default=None) – Action choice probabilities (propensity score) of a behavior policy in the training logged bandit feedback. When None is given, the the behavior policy is assumed to be a uniform one.
position (array-like, shape (n_rounds,), default=None) – Positions of each round in the given logged bandit feedback. If None is set, a regression model assumes that there is only one position. When len_list > 1, this position argument has to be set.
action_dist (array-like, shape (n_rounds, n_actions, len_list), default=None) – Action choice probabilities by the evaluation policy (can be deterministic), i.e., \(\pi_e(a_t|x_t)\). When either of ‘iw’ or ‘mrdr’ is used as the ‘fitting_method’ argument, then action_dist must be given.
-
fit_predict
(context: numpy.ndarray, action: numpy.ndarray, reward: numpy.ndarray, pscore: Optional[numpy.ndarray] = None, position: Optional[numpy.ndarray] = None, action_dist: Optional[numpy.ndarray] = None, n_folds: int = 1, random_state: Optional[int] = None) → None[source]¶ Fit the regression model on given logged bandit feedback data and predict the reward function of the same data.
Note
When n_folds is larger than 1, then the cross-fitting procedure is applied. See the reference for the details about the cross-fitting technique.
- Parameters
context (array-like, shape (n_rounds, dim_context)) – Context vectors in each round, i.e., \(x_t\).
action (array-like, shape (n_rounds,)) – Action sampled by a behavior policy in each round of the logged bandit feedback, i.e., \(a_t\).
reward (array-like, shape (n_rounds,)) – Observed rewards (or outcome) in each round, i.e., \(r_t\).
pscore (array-like, shape (n_rounds,), default=None) – Action choice probabilities (propensity score) of a behavior policy in the training logged bandit feedback. When None is given, the the behavior policy is assumed to be a uniform one.
position (array-like, shape (n_rounds,), default=None) – Positions of each round in the given logged bandit feedback. If None is set, a regression model assumes that there is only one position. When len_list > 1, this position argument has to be set.
action_dist (array-like, shape (n_rounds, n_actions, len_list), default=None) – Action choice probabilities by the evaluation policy (can be deterministic), i.e., \(\pi_e(a_t|x_t)\). When either of ‘iw’ or ‘mrdr’ is used as the ‘fitting_method’ argument, then action_dist must be given.
n_folds (int, default=1) – Number of folds in the cross-fitting procedure. When 1 is given, the regression model is trained on the whole logged bandit feedback data.
random_state (int, default=None) – random_state affects the ordering of the indices, which controls the randomness of each fold. See https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html for the details.
- Returns
estimated_rewards_by_reg_model – Estimated expected rewards for new data by the regression model.
- Return type
array-like, shape (n_rounds, n_actions, len_list)
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params – Parameter names mapped to their values.
- Return type
mapping of string to any
-
predict
(context: numpy.ndarray) → numpy.ndarray[source]¶ Predict the mean reward function.
- Parameters
context (array-like, shape (n_rounds_of_new_data, dim_context)) – Context vectors for new data.
- Returns
estimated_rewards_by_reg_model – Estimated expected rewards for new data by the regression model.
- Return type
array-like, shape (n_rounds_of_new_data, n_actions, len_list)
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
object