obp.ope.regression_model¶

Regression Model Class for Estimating Mean Reward Functions.

Classes

RegressionModel(base_model, n_actions, …)

Machine learning model to estimate the mean reward function (\(q(x,a):= \mathbb{E}[r|x,a]\)).

class obp.ope.regression_model.RegressionModel(base_model: sklearn.base.BaseEstimator, n_actions: int, len_list: int = 1, action_context: Optional[numpy.ndarray] = None, fitting_method: str = 'normal')[source]¶

Bases: sklearn.base.BaseEstimator

Machine learning model to estimate the mean reward function (\(q(x,a):= \mathbb{E}[r|x,a]\)).

Note

Reward (or outcome) \(r\) must be either binary or continuous.

Parameters

base_model (BaseEstimator) – A machine learning model used to estimate the mean reward function.
n_actions (int) – Number of actions.
len_list (int, default=1) – Length of a list of actions recommended in each impression. When Open Bandit Dataset is used, 3 should be set.
action_context (array-like, shape (n_actions, dim_action_context), default=None) – Context vector characterizing each action, vector representation of each action. If not given, then one-hot encoding of the action variable is automatically used.
fitting_method (str, default=’normal’) – Method to fit the regression model. Must be one of [‘normal’, ‘iw’, ‘mrdr’] where ‘iw’ stands for importance weighting and ‘mrdr’ stands for more robust doubly robust.

References

Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. “More Robust Doubly Robust Off-policy Evaluation.”, 2018.

Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudik. “Doubly Robust Off-Policy Evaluation with Shrinkage.”, 2020.

Yusuke Narita, Shota Yasui, and Kohei Yata. “Off-policy Bandit and Reinforcement Learning.”, 2020.

fit(context: numpy.ndarray, action: numpy.ndarray, reward: numpy.ndarray, pscore: Optional[numpy.ndarray] = None, position: Optional[numpy.ndarray] = None, action_dist: Optional[numpy.ndarray] = None) → None[source]¶

Fit the regression model on given logged bandit feedback data.

Parameters

context (array-like, shape (n_rounds, dim_context)) – Context vectors in each round, i.e., \(x_t\).
action (array-like, shape (n_rounds,)) – Action sampled by a behavior policy in each round of the logged bandit feedback, i.e., \(a_t\).
reward (array-like, shape (n_rounds,)) – Observed rewards (or outcome) in each round, i.e., \(r_t\).
pscore (array-like, shape (n_rounds,), default=None) – Action choice probabilities (propensity score) of a behavior policy in the training logged bandit feedback. When None is given, the the behavior policy is assumed to be a uniform one.
position (array-like, shape (n_rounds,), default=None) – Positions of each round in the given logged bandit feedback. If None is set, a regression model assumes that there is only one position. When len_list > 1, this position argument has to be set.
action_dist (array-like, shape (n_rounds, n_actions, len_list), default=None) – Action choice probabilities by the evaluation policy (can be deterministic), i.e., \(\pi_e(a_t|x_t)\). When either of ‘iw’ or ‘mrdr’ is used as the ‘fitting_method’ argument, then action_dist must be given.

fit_predict(context: numpy.ndarray, action: numpy.ndarray, reward: numpy.ndarray, pscore: Optional[numpy.ndarray] = None, position: Optional[numpy.ndarray] = None, action_dist: Optional[numpy.ndarray] = None, n_folds: int = 1, random_state: Optional[int] = None) → None[source]¶

Fit the regression model on given logged bandit feedback data and predict the reward function of the same data.

Note

When n_folds is larger than 1, then the cross-fitting procedure is applied. See the reference for the details about the cross-fitting technique.

Parameters

context (array-like, shape (n_rounds, dim_context)) – Context vectors in each round, i.e., \(x_t\).
action (array-like, shape (n_rounds,)) – Action sampled by a behavior policy in each round of the logged bandit feedback, i.e., \(a_t\).
reward (array-like, shape (n_rounds,)) – Observed rewards (or outcome) in each round, i.e., \(r_t\).
pscore (array-like, shape (n_rounds,), default=None) – Action choice probabilities (propensity score) of a behavior policy in the training logged bandit feedback. When None is given, the the behavior policy is assumed to be a uniform one.
position (array-like, shape (n_rounds,), default=None) – Positions of each round in the given logged bandit feedback. If None is set, a regression model assumes that there is only one position. When len_list > 1, this position argument has to be set.
action_dist (array-like, shape (n_rounds, n_actions, len_list), default=None) – Action choice probabilities by the evaluation policy (can be deterministic), i.e., \(\pi_e(a_t|x_t)\). When either of ‘iw’ or ‘mrdr’ is used as the ‘fitting_method’ argument, then action_dist must be given.
n_folds (int, default=1) – Number of folds in the cross-fitting procedure. When 1 is given, the regression model is trained on the whole logged bandit feedback data.
random_state (int, default=None) – random_state affects the ordering of the indices, which controls the randomness of each fold. See https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html for the details.

Returns

estimated_rewards_by_reg_model – Estimated expected rewards for new data by the regression model.

Return type

array-like, shape (n_rounds, n_actions, len_list)

get_params(deep=True)¶

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: mapping of string to any

predict(context: numpy.ndarray) → numpy.ndarray[source]¶

Predict the mean reward function.

Parameters: context (array-like, shape (n_rounds_of_new_data, dim_context)) – Context vectors for new data.
Returns: estimated_rewards_by_reg_model – Estimated expected rewards for new data by the regression model.
Return type: array-like, shape (n_rounds_of_new_data, n_actions, len_list)

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: object