obp.ope.regression_model

Regression Model Class for Estimating Mean Reward Functions.

Classes

RegressionModel(base_model, n_actions, …)

Machine learning model to estimate the mean reward function (\(q(x,a):= \mathbb{E}[r|x,a]\)).

class obp.ope.regression_model.RegressionModel(base_model: sklearn.base.BaseEstimator, n_actions: int, len_list: int = 1, action_context: Optional[numpy.ndarray] = None, fitting_method: str = 'normal')[source]

Bases: sklearn.base.BaseEstimator

Machine learning model to estimate the mean reward function (\(q(x,a):= \mathbb{E}[r|x,a]\)).

Note

Reward (or outcome) \(r\) must be either binary or continuous.

Parameters
  • base_model (BaseEstimator) – A machine learning model used to estimate the mean reward function.

  • n_actions (int) – Number of actions.

  • len_list (int, default=1) – Length of a list of actions recommended in each impression. When Open Bandit Dataset is used, 3 should be set.

  • action_context (array-like, shape (n_actions, dim_action_context), default=None) – Context vector characterizing each action, vector representation of each action. If not given, then one-hot encoding of the action variable is automatically used.

  • fitting_method (str, default=’normal’) – Method to fit the regression model. Must be one of [‘normal’, ‘iw’, ‘mrdr’] where ‘iw’ stands for importance weighting and ‘mrdr’ stands for more robust doubly robust.

References

Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. “More Robust Doubly Robust Off-policy Evaluation.”, 2018.

Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudik. “Doubly Robust Off-Policy Evaluation with Shrinkage.”, 2020.

Yusuke Narita, Shota Yasui, and Kohei Yata. “Off-policy Bandit and Reinforcement Learning.”, 2020.

fit(context: numpy.ndarray, action: numpy.ndarray, reward: numpy.ndarray, pscore: Optional[numpy.ndarray] = None, position: Optional[numpy.ndarray] = None, action_dist: Optional[numpy.ndarray] = None) → None[source]

Fit the regression model on given logged bandit feedback data.

Parameters
  • context (array-like, shape (n_rounds, dim_context)) – Context vectors in each round, i.e., \(x_t\).

  • action (array-like, shape (n_rounds,)) – Action sampled by a behavior policy in each round of the logged bandit feedback, i.e., \(a_t\).

  • reward (array-like, shape (n_rounds,)) – Observed rewards (or outcome) in each round, i.e., \(r_t\).

  • pscore (array-like, shape (n_rounds,), default=None) – Action choice probabilities (propensity score) of a behavior policy in the training logged bandit feedback. When None is given, the the behavior policy is assumed to be a uniform one.

  • position (array-like, shape (n_rounds,), default=None) – Positions of each round in the given logged bandit feedback. If None is set, a regression model assumes that there is only one position. When len_list > 1, this position argument has to be set.

  • action_dist (array-like, shape (n_rounds, n_actions, len_list), default=None) – Action choice probabilities by the evaluation policy (can be deterministic), i.e., \(\pi_e(a_t|x_t)\). When either of ‘iw’ or ‘mrdr’ is used as the ‘fitting_method’ argument, then action_dist must be given.

fit_predict(context: numpy.ndarray, action: numpy.ndarray, reward: numpy.ndarray, pscore: Optional[numpy.ndarray] = None, position: Optional[numpy.ndarray] = None, action_dist: Optional[numpy.ndarray] = None, n_folds: int = 1, random_state: Optional[int] = None) → None[source]

Fit the regression model on given logged bandit feedback data and predict the reward function of the same data.

Note

When n_folds is larger than 1, then the cross-fitting procedure is applied. See the reference for the details about the cross-fitting technique.

Parameters
  • context (array-like, shape (n_rounds, dim_context)) – Context vectors in each round, i.e., \(x_t\).

  • action (array-like, shape (n_rounds,)) – Action sampled by a behavior policy in each round of the logged bandit feedback, i.e., \(a_t\).

  • reward (array-like, shape (n_rounds,)) – Observed rewards (or outcome) in each round, i.e., \(r_t\).

  • pscore (array-like, shape (n_rounds,), default=None) – Action choice probabilities (propensity score) of a behavior policy in the training logged bandit feedback. When None is given, the the behavior policy is assumed to be a uniform one.

  • position (array-like, shape (n_rounds,), default=None) – Positions of each round in the given logged bandit feedback. If None is set, a regression model assumes that there is only one position. When len_list > 1, this position argument has to be set.

  • action_dist (array-like, shape (n_rounds, n_actions, len_list), default=None) – Action choice probabilities by the evaluation policy (can be deterministic), i.e., \(\pi_e(a_t|x_t)\). When either of ‘iw’ or ‘mrdr’ is used as the ‘fitting_method’ argument, then action_dist must be given.

  • n_folds (int, default=1) – Number of folds in the cross-fitting procedure. When 1 is given, the regression model is trained on the whole logged bandit feedback data.

  • random_state (int, default=None) – random_state affects the ordering of the indices, which controls the randomness of each fold. See https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html for the details.

Returns

estimated_rewards_by_reg_model – Estimated expected rewards for new data by the regression model.

Return type

array-like, shape (n_rounds, n_actions, len_list)

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

mapping of string to any

predict(context: numpy.ndarray) → numpy.ndarray[source]

Predict the mean reward function.

Parameters

context (array-like, shape (n_rounds_of_new_data, dim_context)) – Context vectors for new data.

Returns

estimated_rewards_by_reg_model – Estimated expected rewards for new data by the regression model.

Return type

array-like, shape (n_rounds_of_new_data, n_actions, len_list)

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

object