obp.ope.meta¶
Off-Policy Evaluation Class to Streamline OPE.
Classes
|
Class to conduct off-policy evaluation by multiple off-policy estimators simultaneously. |
-
class
obp.ope.meta.
OffPolicyEvaluation
(bandit_feedback: Dict[str, Union[int, numpy.ndarray]], ope_estimators: List[obp.ope.estimators.BaseOffPolicyEstimator])[source]¶ Bases:
object
Class to conduct off-policy evaluation by multiple off-policy estimators simultaneously.
- Parameters
bandit_feedback (BanditFeedback) – Logged bandit feedback data used for off-policy evaluation.
ope_estimators (List[BaseOffPolicyEstimator]) – List of OPE estimators used to evaluate the policy value of evaluation policy. Estimators must follow the interface of obp.ope.BaseOffPolicyEstimator.
Examples
# a case for implementing OPE of the BernoulliTS policy # using log data generated by the Random policy >>> from obp.dataset import OpenBanditDataset >>> from obp.policy import BernoulliTS >>> from obp.ope import OffPolicyEvaluation, InverseProbabilityWeighting as IPW # (1) Data loading and preprocessing >>> dataset = OpenBanditDataset(behavior_policy='random', campaign='all') >>> bandit_feedback = dataset.obtain_batch_bandit_feedback() >>> bandit_feedback.keys() dict_keys(['n_rounds', 'n_actions', 'action', 'position', 'reward', 'pscore', 'context', 'action_context']) # (2) Off-Policy Learning >>> evaluation_policy = BernoulliTS( n_actions=dataset.n_actions, len_list=dataset.len_list, is_zozotown_prior=True, # replicate the policy in the ZOZOTOWN production campaign="all", random_state=12345 ) >>> action_dist = evaluation_policy.compute_batch_action_dist( n_sim=100000, n_rounds=bandit_feedback["n_rounds"] ) # (3) Off-Policy Evaluation >>> ope = OffPolicyEvaluation(bandit_feedback=bandit_feedback, ope_estimators=[IPW()]) >>> estimated_policy_value = ope.estimate_policy_values(action_dist=action_dist) >>> estimated_policy_value {'ipw': 0.004553...} # policy value improvement of BernoulliTS over the Random policy estimated by IPW >>> estimated_policy_value_improvement = estimated_policy_value['ipw'] / bandit_feedback['reward'].mean() # our OPE procedure suggests that BernoulliTS improves Random by 19.81% >>> print(estimated_policy_value_improvement) 1.198126...
-
estimate_intervals
(action_dist: numpy.ndarray, estimated_rewards_by_reg_model: Optional[numpy.ndarray] = None, alpha: float = 0.05, n_bootstrap_samples: int = 100, random_state: Optional[int] = None) → Dict[str, Dict[str, float]][source]¶ Estimate confidence intervals of estimated policy values using a nonparametric bootstrap procedure.
- Parameters
action_dist (array-like, shape (n_rounds, n_actions, len_list)) – Action choice probabilities by the evaluation policy (can be deterministic), i.e., \(\pi_e(a_t|x_t)\).
estimated_rewards_by_reg_model (array-like, shape (n_rounds, n_actions, len_list), default=None) – Expected rewards for each round, action, and position estimated by a regression model, i.e., \(\hat{q}(x_t,a_t)\). When it is not given, model-dependent estimators such as DM and DR cannot be used.
alpha (float, default=0.05) – P-value.
n_bootstrap_samples (int, default=100) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None) – Controls the random seed in bootstrap sampling.
- Returns
policy_value_interval_dict – Dictionary containing confidence intervals of estimated policy value estimated using a nonparametric bootstrap procedure.
- Return type
Dict[str, Dict[str, float]]
-
estimate_policy_values
(action_dist: numpy.ndarray, estimated_rewards_by_reg_model: Optional[numpy.ndarray] = None) → Dict[str, float][source]¶ Estimate policy value of an evaluation policy.
- Parameters
action_dist (array-like, shape (n_rounds, n_actions, len_list)) – Action choice probabilities by the evaluation policy (can be deterministic), i.e., \(\pi_e(a_t|x_t)\).
estimated_rewards_by_reg_model (array-like, shape (n_rounds, n_actions, len_list), default=None) – Expected rewards for each round, action, and position estimated by a regression model, i.e., \(\hat{q}(x_t,a_t)\). When None is given, model-dependent estimators such as DM and DR cannot be used.
- Returns
policy_value_dict – Dictionary containing estimated policy values by OPE estimators.
- Return type
Dict[str, float]
-
evaluate_performance_of_estimators
(ground_truth_policy_value: float, action_dist: numpy.ndarray, estimated_rewards_by_reg_model: Optional[numpy.ndarray] = None, metric: str = 'relative-ee') → Dict[str, float][source]¶ Evaluate estimation performances of OPE estimators.
Note
Evaluate the estimation performances of OPE estimators by relative estimation error (relative-EE) or squared error (SE):
\[\text{Relative-EE} (\hat{V}; \mathcal{D}) = \left| \frac{\hat{V}(\pi; \mathcal{D}) - V(\pi)}{V(\pi)} \right|,\]\[\text{SE} (\hat{V}; \mathcal{D}) = \left(\hat{V}(\pi; \mathcal{D}) - V(\pi) \right)^2,\]where \(V({\pi})\) is the ground-truth policy value of the evalation policy \(\pi_e\) (often estimated using on-policy estimation). \(\hat{V}(\pi; \mathcal{D})\) is an estimated policy value by an OPE estimator \(\hat{V}\) and logged bandit feedback \(\mathcal{D}\).
- Parameters
ground_truth policy value (float) – Ground_truth policy value of an evaluation policy, i.e., \(V(\pi)\). With Open Bandit Dataset, in general, we use an on-policy estimate of the policy value as its ground-truth.
action_dist (array-like, shape (n_rounds, n_actions, len_list)) – Action choice probabilities by the evaluation policy (can be deterministic), i.e., \(\pi_e(a_t|x_t)\).
estimated_rewards_by_reg_model (array-like, shape (n_rounds, n_actions, len_list), default=None) – Expected rewards for each round, action, and position estimated by a regression model, i.e., \(\hat{q}(x_t,a_t)\). When it is not given, model-dependent estimators such as DM and DR cannot be used.
metric (str, default=”relative-ee”) – Evaluation metric to evaluate and compare the estimation performance of OPE estimators. Must be “relative-ee” or “se”.
- Returns
eval_metric_ope_dict – Dictionary containing evaluation metric for evaluating the estimation performance of OPE estimators.
- Return type
Dict[str, float]
-
summarize_estimators_comparison
(ground_truth_policy_value: float, action_dist: numpy.ndarray, estimated_rewards_by_reg_model: Optional[numpy.ndarray] = None, metric: str = 'relative-ee') → pandas.core.frame.DataFrame[source]¶ Summarize performance comparisons of OPE estimators.
- Parameters
ground_truth policy value (float) – Ground_truth policy value of an evaluation policy, i.e., \(V(\pi)\). With Open Bandit Dataset, in general, we use an on-policy estimate of the policy value as ground-truth.
action_dist (array-like, shape (n_rounds, n_actions, len_list)) – Action choice probabilities by the evaluation policy (can be deterministic), i.e., \(\pi_e(a_t|x_t)\).
estimated_rewards_by_reg_model (array-like, shape (n_rounds, n_actions, len_list), default=None) – Expected rewards for each round, action, and position estimated by a regression model, i.e., \(\hat{q}(x_t,a_t)\). When it is not given, model-dependent estimators such as DM and DR cannot be used.
metric (str, default=”relative-ee”) – Evaluation metric to evaluate and compare the estimation performance of OPE estimators. Must be either “relative-ee” or “se”.
- Returns
eval_metric_ope_df – Evaluation metric for evaluating the estimation performance of OPE estimators.
- Return type
DataFrame
-
summarize_off_policy_estimates
(action_dist: numpy.ndarray, estimated_rewards_by_reg_model: Optional[numpy.ndarray] = None, alpha: float = 0.05, n_bootstrap_samples: int = 100, random_state: Optional[int] = None) → Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame][source]¶ Summarize policy values estimated by OPE estimators and their confidence intervals.
- Parameters
action_dist (array-like, shape (n_rounds, n_actions, len_list)) – Action choice probabilities by the evaluation policy (can be deterministic), i.e., \(\pi_e(a_t|x_t)\).
estimated_rewards_by_reg_model (array-like, shape (n_rounds, n_actions, len_list), default=None) – Expected rewards for each round, action, and position estimated by a regression model, i.e., \(\hat{q}(x_t,a_t)\). When it is not given, model-dependent estimators such as DM and DR cannot be used.
alpha (float, default=0.05) – P-value.
n_bootstrap_samples (int, default=100) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None) – Controls the random seed in bootstrap sampling.
- Returns
(policy_value_df, policy_value_interval_df) – Estimated policy values and their confidence intervals by OPE estimators.
- Return type
Tuple[DataFrame, DataFrame]
-
visualize_off_policy_estimates
(action_dist: numpy.ndarray, estimated_rewards_by_reg_model: Optional[numpy.ndarray] = None, alpha: float = 0.05, is_relative: bool = False, n_bootstrap_samples: int = 100, random_state: Optional[int] = None, fig_dir: Optional[pathlib.Path] = None, fig_name: str = 'estimated_policy_value.png') → None[source]¶ Visualize policy values estimated by OPE estimators.
- Parameters
action_dist (array-like, shape (n_rounds, n_actions, len_list)) – Action choice probabilities by the evaluation policy (can be deterministic), i.e., \(\pi_e(a_t|x_t)\).
estimated_rewards_by_reg_model (array-like, shape (n_rounds, n_actions, len_list), default=None) – Expected rewards for each round, action, and position estimated by a regression model, i.e., \(\hat{q}(x_t,a_t)\). When it is not given, model-dependent estimators such as DM and DR cannot be used.
alpha (float, default=0.05) – P-value.
n_bootstrap_samples (int, default=100) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None) – Controls the random seed in bootstrap sampling.
is_relative (bool, default=False,) – If True, the method visualizes the estimated policy values of evaluation policy relative to the ground-truth policy value of behavior policy.
fig_dir (Path, default=None) – Path to store the bar figure. If ‘None’ is given, the figure will not be saved.
fig_name (str, default=”estimated_policy_value.png”) – Name of the bar figure.