obp.ope.estimators

Off-Policy Estimators.

Classes

BaseOffPolicyEstimator()

Base class for OPE estimators.

DirectMethod(estimator_name)

Estimate the policy value by Direct Method (DM).

DoublyRobust(estimator_name)

Estimate the policy value by Doubly Robust (DR).

DoublyRobustWithShrinkage(estimator_name, …)

Estimate the policy value by Doubly Robust with optimistic shrinkage (DRos).

InverseProbabilityWeighting(estimator_name)

Estimate the policy value by Inverse Probability Weighting (IPW).

ReplayMethod(estimator_name)

Estimate the policy value by Relpay Method (RM).

SelfNormalizedDoublyRobust(estimator_name)

Estimate the policy value by Self-Normalized Doubly Robust (SNDR).

SelfNormalizedInverseProbabilityWeighting(…)

Estimate the policy value by Self-Normalized Inverse Probability Weighting (SNIPW).

SwitchDoublyRobust(estimator_name, tau)

Estimate the policy value by Switch Doubly Robust (Switch-DR).

SwitchInverseProbabilityWeighting(…)

Estimate the policy value by Switch Inverse Probability Weighting (Switch-IPW).

class obp.ope.estimators.BaseOffPolicyEstimator[source]

Bases: object

Base class for OPE estimators.

abstract estimate_interval() → Dict[str, float][source]

Estimate confidence interval of policy value by nonparametric bootstrap procedure.

abstract estimate_policy_value() → float[source]

Estimate policy value of an evaluation policy.

class obp.ope.estimators.DirectMethod(estimator_name: str = 'dm')[source]

Bases: obp.ope.estimators.BaseOffPolicyEstimator

Estimate the policy value by Direct Method (DM).

Note

DM first learns a supervised machine learning model, such as ridge regression and gradient boosting, to estimate the mean reward function (\(q(x,a) = \mathbb{E}[r|x,a]\)). It then uses it to estimate the policy value as follows.

\[\begin{split}\hat{V}_{\mathrm{DM}} (\pi_e; \mathcal{D}, \hat{q}) &:= \mathbb{E}_{\mathcal{D}} \left[ \sum_{a \in \mathcal{A}} \hat{q} (x_t,a) \pi_e(a|x_t) \right], \\ & = \mathbb{E}_{\mathcal{D}}[\hat{q} (x_t,\pi_e)],\end{split}\]

where \(\mathcal{D}=\{(x_t,a_t,r_t)\}_{t=1}^{T}\) is logged bandit feedback data with \(T\) rounds collected by a behavior policy \(\pi_b\). \(\mathbb{E}_{\mathcal{D}}[\cdot]\) is the empirical average over \(T\) observations in \(\mathcal{D}\). \(\hat{q} (x,a)\) is an estimated expected reward given \(x\) and \(a\). \(\hat{q} (x_t,\pi):= \mathbb{E}_{a \sim \pi(a|x)}[\hat{q}(x,a)]\) is the expectation of the estimated reward function over \(\pi\). To estimate the mean reward function, please use obp.ope.regression_model.RegressionModel, which supports several fitting methods specific to OPE.

If the regression model (\(\hat{q}\)) is a good approximation to the true mean reward function, this estimator accurately estimates the policy value of the evaluation policy. If the regression function fails to approximate the mean reward function well, however, the final estimator is no longer consistent.

Parameters

estimator_name (str, default=’dm’.) – Name of off-policy estimator.

References

Alina Beygelzimer and John Langford. “The offset tree for learning with partial labels.”, 2009.

Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. “Doubly Robust Policy Evaluation and Optimization.”, 2014.

estimate_interval(position: numpy.ndarray, action_dist: numpy.ndarray, estimated_rewards_by_reg_model: numpy.ndarray, alpha: float = 0.05, n_bootstrap_samples: int = 10000, random_state: Optional[int] = None, **kwargs) → Dict[str, float][source]

Estimate confidence interval of policy value by nonparametric bootstrap procedure.

Parameters
  • position (array-like, shape (n_rounds,)) – Positions of each round in the given logged bandit feedback.

  • action_dist (array-like, shape (n_rounds, n_actions, len_list)) – Action choice probabilities by the evaluation policy (can be deterministic), i.e., \(\pi_e(a_t|x_t)\).

  • estimated_rewards_by_reg_model (array-like, shape (n_rounds, n_actions, len_list)) – Expected rewards for each round, action, and position estimated by a regression model, i.e., \(\hat{q}(x_t,a_t)\).

  • alpha (float, default=0.05) – P-value.

  • n_bootstrap_samples (int, default=10000) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None) – Controls the random seed in bootstrap sampling.

Returns

estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.

Return type

Dict[str, float]

estimate_policy_value(position: numpy.ndarray, action_dist: numpy.ndarray, estimated_rewards_by_reg_model: numpy.ndarray, **kwargs) → float[source]

Estimate policy value of an evaluation policy.

Parameters
  • position (array-like, shape (n_rounds,)) – Positions of each round in the given logged bandit feedback.

  • action_dist (array-like, shape (n_rounds, n_actions, len_list)) – Action choice probabilities by the evaluation policy (can be deterministic), i.e., \(\pi_e(a_t|x_t)\).

  • estimated_rewards_by_reg_model (array-like, shape (n_rounds, n_actions, len_list)) – Expected rewards for each round, action, and position estimated by a regression model, i.e., \(\hat{q}(x_t,a_t)\).

Returns

V_hat – Estimated policy value (performance) of a given evaluation policy.

Return type

float

class obp.ope.estimators.DoublyRobust(estimator_name: str = 'dr')[source]

Bases: obp.ope.estimators.InverseProbabilityWeighting

Estimate the policy value by Doubly Robust (DR).

Note

Similar to DM, DR first learns a supervised machine learning model, such as ridge regression and gradient boosting, to estimate the mean reward function (\(q(x,a) = \mathbb{E}[r|x,a]\)). It then uses it to estimate the policy value as follows.

\[\hat{V}_{\mathrm{DR}} (\pi_e; \mathcal{D}, \hat{q}) := \mathbb{E}_{\mathcal{D}}[\hat{q}(x_t,\pi_e) + w(x_t,a_t) (r_t - \hat{q}(x_t,a_t))],\]

where \(\mathcal{D}=\{(x_t,a_t,r_t)\}_{t=1}^{T}\) is logged bandit feedback data with \(T\) rounds collected by a behavior policy \(\pi_b\). \(w(x,a):=\pi_e (a|x)/\pi_b (a|x)\) is the importance weight given \(x\) and \(a\). \(\mathbb{E}_{\mathcal{D}}[\cdot]\) is the empirical average over \(T\) observations in \(\mathcal{D}\). \(\hat{q} (x,a)\) is an estimated expected reward given \(x\) and \(a\). \(\hat{q} (x_t,\pi):= \mathbb{E}_{a \sim \pi(a|x)}[\hat{q}(x,a)]\) is the expectation of the estimated reward function over \(\pi\).

To estimate the mean reward function, please use obp.ope.regression_model.RegressionModel, which supports several fitting methods specific to OPE such as more robust doubly robust.

DR mimics IPW to use a weighted version of rewards, but DR also uses the estimated mean reward function (the regression model) as a control variate to decrease the variance. It preserves the consistency of IPW if either the importance weight or the mean reward estimator is accurate (a property called double robustness). Moreover, DR is semiparametric efficient when the mean reward estimator is correctly specified.

Parameters

estimator_name (str, default=’dr’.) – Name of off-policy estimator.

References

Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. “Doubly Robust Policy Evaluation and Optimization.”, 2014.

Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. “More Robust Doubly Robust Off-policy Evaluation.”, 2018.

estimate_interval(reward: numpy.ndarray, action: numpy.ndarray, position: numpy.ndarray, pscore: numpy.ndarray, action_dist: numpy.ndarray, estimated_rewards_by_reg_model: numpy.ndarray, alpha: float = 0.05, n_bootstrap_samples: int = 10000, random_state: Optional[int] = None, **kwargs) → Dict[str, float][source]

Estimate confidence interval of policy value by nonparametric bootstrap procedure.

Parameters
  • reward (array-like, shape (n_rounds,)) – Reward observed in each round of the logged bandit feedback, i.e., \(r_t\).

  • action (array-like, shape (n_rounds,)) – Action sampled by a behavior policy in each round of the logged bandit feedback, i.e., \(a_t\).

  • position (array-like, shape (n_rounds,)) – Positions of each round in the given logged bandit feedback.

  • pscore (array-like, shape (n_rounds,)) – Action choice probabilities by a behavior policy (propensity scores), i.e., \(\pi_b(a_t|x_t)\).

  • action_dist (array-like, shape (n_rounds, n_actions, len_list)) – Action choice probabilities by the evaluation policy (can be deterministic), i.e., \(\pi_e(a_t|x_t)\).

  • estimated_rewards_by_reg_model (array-like, shape (n_rounds, n_actions, len_list)) – Expected rewards for each round, action, and position estimated by a regression model, i.e., \(\hat{q}(x_t,a_t)\).

  • alpha (float, default=0.05) – P-value.

  • n_bootstrap_samples (int, default=10000) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None) – Controls the random seed in bootstrap sampling.

Returns

estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.

Return type

Dict[str, float]

estimate_policy_value(reward: numpy.ndarray, action: numpy.ndarray, position: numpy.ndarray, pscore: numpy.ndarray, action_dist: numpy.ndarray, estimated_rewards_by_reg_model: numpy.ndarray) → float[source]

Estimate policy value of an evaluation policy.

Parameters
  • reward (array-like, shape (n_rounds,)) – Reward observed in each round of the logged bandit feedback, i.e., \(r_t\).

  • action (array-like, shape (n_rounds,)) – Action sampled by a behavior policy in each round of the logged bandit feedback, i.e., \(a_t\).

  • position (array-like, shape (n_rounds,)) – Positions of each round in the given logged bandit feedback.

  • pscore (array-like, shape (n_rounds,)) – Action choice probabilities by a behavior policy (propensity scores), i.e., \(\pi_b(a_t|x_t)\).

  • action_dist (array-like, shape (n_rounds, n_actions, len_list)) – Action choice probabilities by the evaluation policy (can be deterministic), i.e., \(\pi_e(a_t|x_t)\).

  • estimated_rewards_by_reg_model (array-like, shape (n_rounds, n_actions, len_list)) – Expected rewards for each round, action, and position estimated by a regression model, i.e., \(\hat{q}(x_t,a_t)\).

Returns

V_hat – Estimated policy value by the DR estimator.

Return type

float

class obp.ope.estimators.DoublyRobustWithShrinkage(estimator_name: str = 'dr-os', lambda_: float = 0.0)[source]

Bases: obp.ope.estimators.DoublyRobust

Estimate the policy value by Doubly Robust with optimistic shrinkage (DRos).

Note

DR with (optimistic) shrinkage replaces the importance weight in the original DR estimator with a new weight mapping found by directly optimizing sharp bounds on the resulting MSE.

\[\hat{V}_{\mathrm{DRos}} (\pi_e; \mathcal{D}, \hat{q}, \lambda) := \mathbb{E}_{\mathcal{D}} [\hat{q}(x_t,\pi_e) + w_o(x_t,a_t;\lambda) (r_t - \hat{q}(x_t,a_t))],\]

where \(\mathcal{D}=\{(x_t,a_t,r_t)\}_{t=1}^{T}\) is logged bandit feedback data with \(T\) rounds collected by a behavior policy \(\pi_b\). \(w(x,a):=\pi_e (a|x)/\pi_b (a|x)\) is the importance weight given \(x\) and \(a\). \(\hat{q} (x_t,\pi):= \mathbb{E}_{a \sim \pi(a|x)}[\hat{q}(x,a)]\) is the expectation of the estimated reward function over \(\pi\). \(\mathbb{E}_{\mathcal{D}}[\cdot]\) is the empirical average over \(T\) observations in \(\mathcal{D}\). \(\hat{q} (x,a)\) is an estimated expected reward given \(x\) and \(a\). To estimate the mean reward function, please use obp.ope.regression_model.RegressionModel.

\(w_{o} (x_t,a_t;\lambda)\) is a new weight by the shrinkage technique which is defined as

\[w_{o} (x_t,a_t;\lambda) := \frac{\lambda}{w^2(x_t,a_t) + \lambda} w(x_t,a_t).\]

When \(\lambda=0\), we have \(w_{o} (x,a;\lambda)=0\) corresponding to the DM estimator. In contrast, as \(\lambda \rightarrow \infty\), \(w_{o} (x,a;\lambda)\) increases and in the limit becomes equal to the original importance weight, corresponding to the standard DR estimator.

Parameters
  • lambda_ (float) – Shrinkage hyperparameter. This hyperparameter should be larger than or equal to 0., otherwise it is meaningless.

  • estimator_name (str, default=’dr-os’.) – Name of off-policy estimator.

References

Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. “Doubly Robust Policy Evaluation and Optimization.”, 2014.

Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudik. “Doubly Robust Off-Policy Evaluation with Shrinkage.”, 2020.

estimate_interval(reward: numpy.ndarray, action: numpy.ndarray, position: numpy.ndarray, pscore: numpy.ndarray, action_dist: numpy.ndarray, estimated_rewards_by_reg_model: numpy.ndarray, alpha: float = 0.05, n_bootstrap_samples: int = 10000, random_state: Optional[int] = None, **kwargs) → Dict[str, float]

Estimate confidence interval of policy value by nonparametric bootstrap procedure.

Parameters
  • reward (array-like, shape (n_rounds,)) – Reward observed in each round of the logged bandit feedback, i.e., \(r_t\).

  • action (array-like, shape (n_rounds,)) – Action sampled by a behavior policy in each round of the logged bandit feedback, i.e., \(a_t\).

  • position (array-like, shape (n_rounds,)) – Positions of each round in the given logged bandit feedback.

  • pscore (array-like, shape (n_rounds,)) – Action choice probabilities by a behavior policy (propensity scores), i.e., \(\pi_b(a_t|x_t)\).

  • action_dist (array-like, shape (n_rounds, n_actions, len_list)) – Action choice probabilities by the evaluation policy (can be deterministic), i.e., \(\pi_e(a_t|x_t)\).

  • estimated_rewards_by_reg_model (array-like, shape (n_rounds, n_actions, len_list)) – Expected rewards for each round, action, and position estimated by a regression model, i.e., \(\hat{q}(x_t,a_t)\).

  • alpha (float, default=0.05) – P-value.

  • n_bootstrap_samples (int, default=10000) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None) – Controls the random seed in bootstrap sampling.

Returns

estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.

Return type

Dict[str, float]

estimate_policy_value(reward: numpy.ndarray, action: numpy.ndarray, position: numpy.ndarray, pscore: numpy.ndarray, action_dist: numpy.ndarray, estimated_rewards_by_reg_model: numpy.ndarray) → float

Estimate policy value of an evaluation policy.

Parameters
  • reward (array-like, shape (n_rounds,)) – Reward observed in each round of the logged bandit feedback, i.e., \(r_t\).

  • action (array-like, shape (n_rounds,)) – Action sampled by a behavior policy in each round of the logged bandit feedback, i.e., \(a_t\).

  • position (array-like, shape (n_rounds,)) – Positions of each round in the given logged bandit feedback.

  • pscore (array-like, shape (n_rounds,)) – Action choice probabilities by a behavior policy (propensity scores), i.e., \(\pi_b(a_t|x_t)\).

  • action_dist (array-like, shape (n_rounds, n_actions, len_list)) – Action choice probabilities by the evaluation policy (can be deterministic), i.e., \(\pi_e(a_t|x_t)\).

  • estimated_rewards_by_reg_model (array-like, shape (n_rounds, n_actions, len_list)) – Expected rewards for each round, action, and position estimated by a regression model, i.e., \(\hat{q}(x_t,a_t)\).

Returns

V_hat – Estimated policy value by the DR estimator.

Return type

float

class obp.ope.estimators.InverseProbabilityWeighting(estimator_name: str = 'ipw')[source]

Bases: obp.ope.estimators.BaseOffPolicyEstimator

Estimate the policy value by Inverse Probability Weighting (IPW).

Note

Inverse Probability Weighting (IPW) estimates the policy value of a given evaluation policy \(\pi_e\) by

\[\hat{V}_{\mathrm{IPW}} (\pi_e; \mathcal{D}) := \mathbb{E}_{\mathcal{D}} [ w(x_t,a_t) r_t],\]

where \(\mathcal{D}=\{(x_t,a_t,r_t)\}_{t=1}^{T}\) is logged bandit feedback data with \(T\) rounds collected by a behavior policy \(\pi_b\). \(w(x,a):=\pi_e (a|x)/\pi_b (a|x)\) is the importance weight given \(x\) and \(a\). \(\mathbb{E}_{\mathcal{D}}[\cdot]\) is the empirical average over \(T\) observations in \(\mathcal{D}\).

IPW re-weights the rewards by the ratio of the evaluation policy and behavior policy (importance weight). When the behavior policy is known, IPW is unbiased and consistent for the true policy value. However, it can have a large variance, especially when the evaluation policy significantly deviates from the behavior policy.

Parameters

estimator_name (str, default=’ipw’.) – Name of off-policy estimator.

References

Alex Strehl, John Langford, Lihong Li, and Sham M Kakade. “Learning from Logged Implicit Exploration Data”., 2010.

Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. “Doubly Robust Policy Evaluation and Optimization.”, 2014.

estimate_interval(reward: numpy.ndarray, action: numpy.ndarray, position: numpy.ndarray, pscore: numpy.ndarray, action_dist: numpy.ndarray, alpha: float = 0.05, n_bootstrap_samples: int = 10000, random_state: Optional[int] = None, **kwargs) → Dict[str, float][source]

Estimate confidence interval of policy value by nonparametric bootstrap procedure.

Parameters
  • reward (array-like, shape (n_rounds,)) – Reward observed in each round of the logged bandit feedback, i.e., \(r_t\).

  • action (array-like, shape (n_rounds,)) – Action sampled by a behavior policy in each round of the logged bandit feedback, i.e., \(a_t\).

  • position (array-like, shape (n_rounds,)) – Positions of each round in the given logged bandit feedback.

  • pscore (array-like, shape (n_rounds,)) – Action choice probabilities by a behavior policy (propensity scores), i.e., \(\pi_b(a_t|x_t)\).

  • action_dist (array-like, shape (n_rounds, n_actions, len_list)) – Action choice probabilities by the evaluation policy (can be deterministic), i.e., \(\pi_e(a_t|x_t)\).

  • alpha (float, default=0.05) – P-value.

  • n_bootstrap_samples (int, default=10000) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None) – Controls the random seed in bootstrap sampling.

Returns

estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.

Return type

Dict[str, float]

estimate_policy_value(reward: numpy.ndarray, action: numpy.ndarray, position: numpy.ndarray, pscore: numpy.ndarray, action_dist: numpy.ndarray, **kwargs) → numpy.ndarray[source]

Estimate policy value of an evaluation policy.

Parameters
  • reward (array-like, shape (n_rounds,)) – Reward observed in each round of the logged bandit feedback, i.e., \(r_t\).

  • action (array-like, shape (n_rounds,)) – Action sampled by a behavior policy in each round of the logged bandit feedback, i.e., \(a_t\).

  • position (array-like, shape (n_rounds,)) – Positions of each round in the given logged bandit feedback.

  • pscore (array-like, shape (n_rounds,)) – Action choice probabilities by a behavior policy (propensity scores), i.e., \(\pi_b(a_t|x_t)\).

  • action_dist (array-like, shape (n_rounds, n_actions, len_list)) – Action choice probabilities by the evaluation policy (can be deterministic), i.e., \(\pi_e(a_t|x_t)\).

Returns

V_hat – Estimated policy value (performance) of a given evaluation policy.

Return type

float

class obp.ope.estimators.ReplayMethod(estimator_name: str = 'rm')[source]

Bases: obp.ope.estimators.BaseOffPolicyEstimator

Estimate the policy value by Relpay Method (RM).

Note

Replay Method (RM) estimates the policy value of a given evaluation policy \(\pi_e\) by

\[\hat{V}_{\mathrm{RM}} (\pi_e; \mathcal{D}) := \frac{\mathbb{E}_{\mathcal{D}}[\mathbb{I} \{ \pi_e (x_t) = a_t \} r_t ]}{\mathbb{E}_{\mathcal{D}}[\mathbb{I} \{ \pi_e (x_t) = a_t \}]},\]

where \(\mathcal{D}=\{(x_t,a_t,r_t)\}_{t=1}^{T}\) is logged bandit feedback data with \(T\) rounds collected by a behavior policy \(\pi_b\). \(\pi_e: \mathcal{X} \rightarrow \mathcal{A}\) is the function representing action choices by the evaluation policy realized during offline bandit simulation. \(\mathbb{E}_{\mathcal{D}}[\cdot]\) is the empirical average over \(T\) observations in \(\mathcal{D}\).

Parameters

estimator_name (str, default=’rm’.) – Name of off-policy estimator.

References

Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. “Unbiased Offline Evaluation of Contextual-bandit-based News Article Recommendation Algorithms.”, 2011.

estimate_interval(reward: numpy.ndarray, action: numpy.ndarray, position: numpy.ndarray, action_dist: numpy.ndarray, alpha: float = 0.05, n_bootstrap_samples: int = 100, random_state: Optional[int] = None, **kwargs) → Dict[str, float][source]

Estimate confidence interval of policy value by nonparametric bootstrap procedure.

Parameters
  • reward (array-like, shape (n_rounds,)) – Reward observed in each round of the logged bandit feedback, i.e., \(r_t\).

  • action (array-like, shape (n_rounds,)) – Action sampled by a behavior policy in each round of the logged bandit feedback, i.e., \(a_t\).

  • position (array-like, shape (n_rounds,)) – Positions of each round in the given logged bandit feedback.

  • alpha (float, default=0.05) – P-value.

  • n_bootstrap_samples (int, default=10000) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None) – Controls the random seed in bootstrap sampling.

Returns

estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.

Return type

Dict[str, float]

estimate_policy_value(reward: numpy.ndarray, action: numpy.ndarray, position: numpy.ndarray, action_dist: numpy.ndarray, **kwargs) → float[source]

Estimate policy value of an evaluation policy.

Parameters
  • reward (array-like, shape (n_rounds,)) – Reward observed in each round of the logged bandit feedback, i.e., \(r_t\).

  • action (array-like, shape (n_rounds,)) – Action sampled by a behavior policy in each round of the logged bandit feedback, i.e., \(a_t\).

  • position (array-like, shape (n_rounds,)) – Positions of each round in the given logged bandit feedback.

Returns

V_hat – Estimated policy value (performance) of a given evaluation policy.

Return type

float

class obp.ope.estimators.SelfNormalizedDoublyRobust(estimator_name: str = 'sndr')[source]

Bases: obp.ope.estimators.DoublyRobust

Estimate the policy value by Self-Normalized Doubly Robust (SNDR).

Note

Self-Normalized Doubly Robust estimates the policy value of a given evaluation policy \(\pi_e\) by

\[\hat{V}_{\mathrm{SNDR}} (\pi_e; \mathcal{D}, \hat{q}) := \mathbb{E}_{\mathcal{D}} \left[\hat{q}(x_t,\pi_e) + \frac{w(x_t,a_t) (r_t - \hat{q}(x_t,a_t))}{\mathbb{E}_{\mathcal{D}}[ w(x_t,a_t) ]} \right],\]

where \(\mathcal{D}=\{(x_t,a_t,r_t)\}_{t=1}^{T}\) is logged bandit feedback data with \(T\) rounds collected by a behavior policy \(\pi_b\). \(w(x,a):=\pi_e (a|x)/\pi_b (a|x)\) is the importance weight given \(x\) and \(a\). \(\mathbb{E}_{\mathcal{D}}[\cdot]\) is the empirical average over \(T\) observations in \(\mathcal{D}\). \(\hat{q} (x,a)\) is an estimated expected reward given \(x\) and \(a\). \(\hat{q} (x_t,\pi):= \mathbb{E}_{a \sim \pi(a|x)}[\hat{q}(x,a)]\) is the expectation of the estimated reward function over \(\pi\). To estimate the mean reward function, please use obp.ope.regression_model.RegressionModel.

Similar to Self-Normalized Inverse Probability Weighting, SNDR estimator applies the self-normalized importance weighting technique to increase the stability of the original Doubly Robust estimator.

Parameters

estimator_name (str, default=’sndr’.) – Name of off-policy estimator.

References

Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. “Doubly Robust Policy Evaluation and Optimization.”, 2014.

Nathan Kallus and Masatoshi Uehara. “Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning.”, 2019.

estimate_interval(reward: numpy.ndarray, action: numpy.ndarray, position: numpy.ndarray, pscore: numpy.ndarray, action_dist: numpy.ndarray, estimated_rewards_by_reg_model: numpy.ndarray, alpha: float = 0.05, n_bootstrap_samples: int = 10000, random_state: Optional[int] = None, **kwargs) → Dict[str, float]

Estimate confidence interval of policy value by nonparametric bootstrap procedure.

Parameters
  • reward (array-like, shape (n_rounds,)) – Reward observed in each round of the logged bandit feedback, i.e., \(r_t\).

  • action (array-like, shape (n_rounds,)) – Action sampled by a behavior policy in each round of the logged bandit feedback, i.e., \(a_t\).

  • position (array-like, shape (n_rounds,)) – Positions of each round in the given logged bandit feedback.

  • pscore (array-like, shape (n_rounds,)) – Action choice probabilities by a behavior policy (propensity scores), i.e., \(\pi_b(a_t|x_t)\).

  • action_dist (array-like, shape (n_rounds, n_actions, len_list)) – Action choice probabilities by the evaluation policy (can be deterministic), i.e., \(\pi_e(a_t|x_t)\).

  • estimated_rewards_by_reg_model (array-like, shape (n_rounds, n_actions, len_list)) – Expected rewards for each round, action, and position estimated by a regression model, i.e., \(\hat{q}(x_t,a_t)\).

  • alpha (float, default=0.05) – P-value.

  • n_bootstrap_samples (int, default=10000) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None) – Controls the random seed in bootstrap sampling.

Returns

estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.

Return type

Dict[str, float]

estimate_policy_value(reward: numpy.ndarray, action: numpy.ndarray, position: numpy.ndarray, pscore: numpy.ndarray, action_dist: numpy.ndarray, estimated_rewards_by_reg_model: numpy.ndarray) → float

Estimate policy value of an evaluation policy.

Parameters
  • reward (array-like, shape (n_rounds,)) – Reward observed in each round of the logged bandit feedback, i.e., \(r_t\).

  • action (array-like, shape (n_rounds,)) – Action sampled by a behavior policy in each round of the logged bandit feedback, i.e., \(a_t\).

  • position (array-like, shape (n_rounds,)) – Positions of each round in the given logged bandit feedback.

  • pscore (array-like, shape (n_rounds,)) – Action choice probabilities by a behavior policy (propensity scores), i.e., \(\pi_b(a_t|x_t)\).

  • action_dist (array-like, shape (n_rounds, n_actions, len_list)) – Action choice probabilities by the evaluation policy (can be deterministic), i.e., \(\pi_e(a_t|x_t)\).

  • estimated_rewards_by_reg_model (array-like, shape (n_rounds, n_actions, len_list)) – Expected rewards for each round, action, and position estimated by a regression model, i.e., \(\hat{q}(x_t,a_t)\).

Returns

V_hat – Estimated policy value by the DR estimator.

Return type

float

class obp.ope.estimators.SelfNormalizedInverseProbabilityWeighting(estimator_name: str = 'snipw')[source]

Bases: obp.ope.estimators.InverseProbabilityWeighting

Estimate the policy value by Self-Normalized Inverse Probability Weighting (SNIPW).

Note

Self-Normalized Inverse Probability Weighting (SNIPW) estimates the policy value of a given evaluation policy \(\pi_e\) by

\[\hat{V}_{\mathrm{SNIPW}} (\pi_e; \mathcal{D}) := \frac{\mathbb{E}_{\mathcal{D}} [w(x_t,a_t) r_t]}{ \mathbb{E}_{\mathcal{D}} [w(x_t,a_t)]},\]

where \(\mathcal{D}=\{(x_t,a_t,r_t)\}_{t=1}^{T}\) is logged bandit feedback data with \(T\) rounds collected by a behavior policy \(\pi_b\). \(w(x,a):=\pi_e (a|x)/\pi_b (a|x)\) is the importance weight given \(x\) and \(a\). \(\mathbb{E}_{\mathcal{D}}[\cdot]\) is the empirical average over \(T\) observations in \(\mathcal{D}\).

SNIPW re-weights the observed rewards by the self-normalized importance weihgt. This estimator is not unbiased even when the behavior policy is known. However, it is still consistent for the true policy value and increases the stability in some senses. See the references for the detailed discussions.

Parameters

estimator_name (str, default=’snipw’.) – Name of off-policy estimator.

References

Adith Swaminathan and Thorsten Joachims. “The Self-normalized Estimator for Counterfactual Learning.”, 2015.

Nathan Kallus and Masatoshi Uehara. “Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning.”, 2019.

estimate_interval(reward: numpy.ndarray, action: numpy.ndarray, position: numpy.ndarray, pscore: numpy.ndarray, action_dist: numpy.ndarray, alpha: float = 0.05, n_bootstrap_samples: int = 10000, random_state: Optional[int] = None, **kwargs) → Dict[str, float]

Estimate confidence interval of policy value by nonparametric bootstrap procedure.

Parameters
  • reward (array-like, shape (n_rounds,)) – Reward observed in each round of the logged bandit feedback, i.e., \(r_t\).

  • action (array-like, shape (n_rounds,)) – Action sampled by a behavior policy in each round of the logged bandit feedback, i.e., \(a_t\).

  • position (array-like, shape (n_rounds,)) – Positions of each round in the given logged bandit feedback.

  • pscore (array-like, shape (n_rounds,)) – Action choice probabilities by a behavior policy (propensity scores), i.e., \(\pi_b(a_t|x_t)\).

  • action_dist (array-like, shape (n_rounds, n_actions, len_list)) – Action choice probabilities by the evaluation policy (can be deterministic), i.e., \(\pi_e(a_t|x_t)\).

  • alpha (float, default=0.05) – P-value.

  • n_bootstrap_samples (int, default=10000) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None) – Controls the random seed in bootstrap sampling.

Returns

estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.

Return type

Dict[str, float]

estimate_policy_value(reward: numpy.ndarray, action: numpy.ndarray, position: numpy.ndarray, pscore: numpy.ndarray, action_dist: numpy.ndarray, **kwargs) → numpy.ndarray

Estimate policy value of an evaluation policy.

Parameters
  • reward (array-like, shape (n_rounds,)) – Reward observed in each round of the logged bandit feedback, i.e., \(r_t\).

  • action (array-like, shape (n_rounds,)) – Action sampled by a behavior policy in each round of the logged bandit feedback, i.e., \(a_t\).

  • position (array-like, shape (n_rounds,)) – Positions of each round in the given logged bandit feedback.

  • pscore (array-like, shape (n_rounds,)) – Action choice probabilities by a behavior policy (propensity scores), i.e., \(\pi_b(a_t|x_t)\).

  • action_dist (array-like, shape (n_rounds, n_actions, len_list)) – Action choice probabilities by the evaluation policy (can be deterministic), i.e., \(\pi_e(a_t|x_t)\).

Returns

V_hat – Estimated policy value (performance) of a given evaluation policy.

Return type

float

class obp.ope.estimators.SwitchDoublyRobust(estimator_name: str = 'switch-dr', tau: float = 1)[source]

Bases: obp.ope.estimators.DoublyRobust

Estimate the policy value by Switch Doubly Robust (Switch-DR).

Note

Switch-DR aims to reduce the variance of the DR estimator by using direct method when the importance weight is large. This estimator estimates the policy value of a given evaluation policy \(\pi_e\) by

\[\hat{V}_{\mathrm{SwitchDR}} (\pi_e; \mathcal{D}, \hat{q}, \tau) := \mathbb{E}_{\mathcal{D}} [\hat{q}(x_t,\pi_e) + w(x_t,a_t) (r_t - \hat{q}(x_t,a_t)) \mathbb{I} \{ w(x_t,a_t) \le \tau \}],\]

where \(\mathcal{D}=\{(x_t,a_t,r_t)\}_{t=1}^{T}\) is logged bandit feedback data with \(T\) rounds collected by a behavior policy \(\pi_b\). \(w(x,a):=\pi_e (a|x)/\pi_b (a|x)\) is the importance weight given \(x\) and \(a\). \(\mathbb{E}_{\mathcal{D}}[\cdot]\) is the empirical average over \(T\) observations in \(\mathcal{D}\). \(\tau (\ge 0)\) is a switching hyperparameter, which decides the threshold for the importance weight. \(\hat{q} (x,a)\) is an estimated expected reward given \(x\) and \(a\). \(\hat{q} (x_t,\pi):= \mathbb{E}_{a \sim \pi(a|x)}[\hat{q}(x,a)]\) is the expectation of the estimated reward function over \(\pi\). To estimate the mean reward function, please use obp.ope.regression_model.RegressionModel.

Parameters
  • tau (float, default=1) – Switching hyperparameter. When importance weight is larger than this parameter, the DM estimator is applied, otherwise the DR estimator is applied. This hyperparameter should be larger than or equal to 0., otherwise it is meaningless.

  • estimator_name (str, default=’switch-dr’.) – Name of off-policy estimator.

References

Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. “Doubly Robust Policy Evaluation and Optimization.”, 2014.

Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudík. “Optimal and Adaptive Off-policy Evaluation in Contextual Bandits”, 2016.

estimate_interval(reward: numpy.ndarray, action: numpy.ndarray, position: numpy.ndarray, pscore: numpy.ndarray, action_dist: numpy.ndarray, estimated_rewards_by_reg_model: numpy.ndarray, alpha: float = 0.05, n_bootstrap_samples: int = 10000, random_state: Optional[int] = None, **kwargs) → Dict[str, float]

Estimate confidence interval of policy value by nonparametric bootstrap procedure.

Parameters
  • reward (array-like, shape (n_rounds,)) – Reward observed in each round of the logged bandit feedback, i.e., \(r_t\).

  • action (array-like, shape (n_rounds,)) – Action sampled by a behavior policy in each round of the logged bandit feedback, i.e., \(a_t\).

  • position (array-like, shape (n_rounds,)) – Positions of each round in the given logged bandit feedback.

  • pscore (array-like, shape (n_rounds,)) – Action choice probabilities by a behavior policy (propensity scores), i.e., \(\pi_b(a_t|x_t)\).

  • action_dist (array-like, shape (n_rounds, n_actions, len_list)) – Action choice probabilities by the evaluation policy (can be deterministic), i.e., \(\pi_e(a_t|x_t)\).

  • estimated_rewards_by_reg_model (array-like, shape (n_rounds, n_actions, len_list)) – Expected rewards for each round, action, and position estimated by a regression model, i.e., \(\hat{q}(x_t,a_t)\).

  • alpha (float, default=0.05) – P-value.

  • n_bootstrap_samples (int, default=10000) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None) – Controls the random seed in bootstrap sampling.

Returns

estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.

Return type

Dict[str, float]

estimate_policy_value(reward: numpy.ndarray, action: numpy.ndarray, position: numpy.ndarray, pscore: numpy.ndarray, action_dist: numpy.ndarray, estimated_rewards_by_reg_model: numpy.ndarray) → float

Estimate policy value of an evaluation policy.

Parameters
  • reward (array-like, shape (n_rounds,)) – Reward observed in each round of the logged bandit feedback, i.e., \(r_t\).

  • action (array-like, shape (n_rounds,)) – Action sampled by a behavior policy in each round of the logged bandit feedback, i.e., \(a_t\).

  • position (array-like, shape (n_rounds,)) – Positions of each round in the given logged bandit feedback.

  • pscore (array-like, shape (n_rounds,)) – Action choice probabilities by a behavior policy (propensity scores), i.e., \(\pi_b(a_t|x_t)\).

  • action_dist (array-like, shape (n_rounds, n_actions, len_list)) – Action choice probabilities by the evaluation policy (can be deterministic), i.e., \(\pi_e(a_t|x_t)\).

  • estimated_rewards_by_reg_model (array-like, shape (n_rounds, n_actions, len_list)) – Expected rewards for each round, action, and position estimated by a regression model, i.e., \(\hat{q}(x_t,a_t)\).

Returns

V_hat – Estimated policy value by the DR estimator.

Return type

float

class obp.ope.estimators.SwitchInverseProbabilityWeighting(estimator_name: str = 'switch-ipw', tau: float = 1)[source]

Bases: obp.ope.estimators.DoublyRobust

Estimate the policy value by Switch Inverse Probability Weighting (Switch-IPW).

Note

Switch-IPW aims to reduce the variance of the IPW estimator by using direct method when the importance weight is large. This estimator estimates the policy value of a given evaluation policy \(\pi_e\) by

\[\begin{split}& \hat{V}_{\mathrm{SwitchIPW}} (\pi_e; \mathcal{D}, \tau) \\ & := \mathbb{E}_{\mathcal{D}} \left[ \sum_{a \in \mathcal{A}} \hat{q} (x_t, a) \pi_e (a|x_t) \mathbb{I} \{ w(x_t, a) > \tau \} + w(x_t,a_t) r_t \mathbb{I} \{ w(x_t,a_t) \le \tau \} \right],\end{split}\]

where \(\mathcal{D}=\{(x_t,a_t,r_t)\}_{t=1}^{T}\) is logged bandit feedback data with \(T\) rounds collected by a behavior policy \(\pi_b\). \(w(x,a):=\pi_e (a|x)/\pi_b (a|x)\) is the importance weight given \(x\) and \(a\). \(\mathbb{E}_{\mathcal{D}}[\cdot]\) is the empirical average over \(T\) observations in \(\mathcal{D}\). \(\tau (\ge 0)\) is a switching hyperparameter, which decides the threshold for the importance weight. To estimate the mean reward function, please use obp.ope.regression_model.RegressionModel.

Parameters
  • tau (float, default=1) – Switching hyperparameter. When importance weight is larger than this parameter, the DM estimator is applied, otherwise the IPW estimator is applied. This hyperparameter should be larger than 1., otherwise it is meaningless.

  • estimator_name (str, default=’switch-ipw’.) – Name of off-policy estimator.

References

Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. “Doubly Robust Policy Evaluation and Optimization.”, 2014.

Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudík. “Optimal and Adaptive Off-policy Evaluation in Contextual Bandits”, 2016.

estimate_interval(reward: numpy.ndarray, action: numpy.ndarray, position: numpy.ndarray, pscore: numpy.ndarray, action_dist: numpy.ndarray, estimated_rewards_by_reg_model: numpy.ndarray, alpha: float = 0.05, n_bootstrap_samples: int = 10000, random_state: Optional[int] = None, **kwargs) → Dict[str, float]

Estimate confidence interval of policy value by nonparametric bootstrap procedure.

Parameters
  • reward (array-like, shape (n_rounds,)) – Reward observed in each round of the logged bandit feedback, i.e., \(r_t\).

  • action (array-like, shape (n_rounds,)) – Action sampled by a behavior policy in each round of the logged bandit feedback, i.e., \(a_t\).

  • position (array-like, shape (n_rounds,)) – Positions of each round in the given logged bandit feedback.

  • pscore (array-like, shape (n_rounds,)) – Action choice probabilities by a behavior policy (propensity scores), i.e., \(\pi_b(a_t|x_t)\).

  • action_dist (array-like, shape (n_rounds, n_actions, len_list)) – Action choice probabilities by the evaluation policy (can be deterministic), i.e., \(\pi_e(a_t|x_t)\).

  • estimated_rewards_by_reg_model (array-like, shape (n_rounds, n_actions, len_list)) – Expected rewards for each round, action, and position estimated by a regression model, i.e., \(\hat{q}(x_t,a_t)\).

  • alpha (float, default=0.05) – P-value.

  • n_bootstrap_samples (int, default=10000) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None) – Controls the random seed in bootstrap sampling.

Returns

estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.

Return type

Dict[str, float]

estimate_policy_value(reward: numpy.ndarray, action: numpy.ndarray, position: numpy.ndarray, pscore: numpy.ndarray, action_dist: numpy.ndarray, estimated_rewards_by_reg_model: numpy.ndarray) → float

Estimate policy value of an evaluation policy.

Parameters
  • reward (array-like, shape (n_rounds,)) – Reward observed in each round of the logged bandit feedback, i.e., \(r_t\).

  • action (array-like, shape (n_rounds,)) – Action sampled by a behavior policy in each round of the logged bandit feedback, i.e., \(a_t\).

  • position (array-like, shape (n_rounds,)) – Positions of each round in the given logged bandit feedback.

  • pscore (array-like, shape (n_rounds,)) – Action choice probabilities by a behavior policy (propensity scores), i.e., \(\pi_b(a_t|x_t)\).

  • action_dist (array-like, shape (n_rounds, n_actions, len_list)) – Action choice probabilities by the evaluation policy (can be deterministic), i.e., \(\pi_e(a_t|x_t)\).

  • estimated_rewards_by_reg_model (array-like, shape (n_rounds, n_actions, len_list)) – Expected rewards for each round, action, and position estimated by a regression model, i.e., \(\hat{q}(x_t,a_t)\).

Returns

V_hat – Estimated policy value by the DR estimator.

Return type

float