obp.ope.meta¶

Off-Policy Evaluation Class to Streamline OPE.

Classes

OffPolicyEvaluation(bandit_feedback, …)

Class to conduct off-policy evaluation by multiple off-policy estimators simultaneously.

class obp.ope.meta.OffPolicyEvaluation(bandit_feedback: Dict[str, Union[int, numpy.ndarray]], ope_estimators: List[obp.ope.estimators.BaseOffPolicyEstimator])[source]¶

Bases: object

Class to conduct off-policy evaluation by multiple off-policy estimators simultaneously.

Parameters

bandit_feedback (BanditFeedback) – Logged bandit feedback data used for off-policy evaluation.
ope_estimators (List[BaseOffPolicyEstimator]) – List of OPE estimators used to evaluate the policy value of evaluation policy. Estimators must follow the interface of obp.ope.BaseOffPolicyEstimator.

Examples

# a case for implementing OPE of the BernoulliTS policy
# using log data generated by the Random policy
>>> from obp.dataset import OpenBanditDataset
>>> from obp.policy import BernoulliTS
>>> from obp.ope import OffPolicyEvaluation, InverseProbabilityWeighting as IPW

# (1) Data loading and preprocessing
>>> dataset = OpenBanditDataset(behavior_policy='random', campaign='all')
>>> bandit_feedback = dataset.obtain_batch_bandit_feedback()
>>> bandit_feedback.keys()
dict_keys(['n_rounds', 'n_actions', 'action', 'position', 'reward', 'pscore', 'context', 'action_context'])

# (2) Off-Policy Learning
>>> evaluation_policy = BernoulliTS(
    n_actions=dataset.n_actions,
    len_list=dataset.len_list,
    is_zozotown_prior=True, # replicate the policy in the ZOZOTOWN production
    campaign="all",
    random_state=12345
)
>>> action_dist = evaluation_policy.compute_batch_action_dist(
    n_sim=100000, n_rounds=bandit_feedback["n_rounds"]
)

# (3) Off-Policy Evaluation
>>> ope = OffPolicyEvaluation(bandit_feedback=bandit_feedback, ope_estimators=[IPW()])
>>> estimated_policy_value = ope.estimate_policy_values(action_dist=action_dist)
>>> estimated_policy_value
{'ipw': 0.004553...}

# policy value improvement of BernoulliTS over the Random policy estimated by IPW
>>> estimated_policy_value_improvement = estimated_policy_value['ipw'] / bandit_feedback['reward'].mean()
# our OPE procedure suggests that BernoulliTS improves Random by 19.81%
>>> print(estimated_policy_value_improvement)
1.198126...

estimate_intervals(action_dist: numpy.ndarray, estimated_rewards_by_reg_model: Optional[numpy.ndarray] = None, alpha: float = 0.05, n_bootstrap_samples: int = 100, random_state: Optional[int] = None) → Dict[str, Dict[str, float]][source]¶

Estimate confidence intervals of estimated policy values using a nonparametric bootstrap procedure.

Parameters

action_dist (array-like, shape (n_rounds, n_actions, len_list)) – Action choice probabilities by the evaluation policy (can be deterministic), i.e., \(\pi_e(a_t|x_t)\).
estimated_rewards_by_reg_model (array-like, shape (n_rounds, n_actions, len_list), default=None) – Expected rewards for each round, action, and position estimated by a regression model, i.e., \(\hat{q}(x_t,a_t)\). When it is not given, model-dependent estimators such as DM and DR cannot be used.
alpha (float, default=0.05) – P-value.
n_bootstrap_samples (int, default=100) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None) – Controls the random seed in bootstrap sampling.

Returns

policy_value_interval_dict – Dictionary containing confidence intervals of estimated policy value estimated using a nonparametric bootstrap procedure.

Return type

Dict[str, Dict[str, float]]

estimate_policy_values(action_dist: numpy.ndarray, estimated_rewards_by_reg_model: Optional[numpy.ndarray] = None) → Dict[str, float][source]¶

Estimate policy value of an evaluation policy.

Parameters

action_dist (array-like, shape (n_rounds, n_actions, len_list)) – Action choice probabilities by the evaluation policy (can be deterministic), i.e., \(\pi_e(a_t|x_t)\).
estimated_rewards_by_reg_model (array-like, shape (n_rounds, n_actions, len_list), default=None) – Expected rewards for each round, action, and position estimated by a regression model, i.e., \(\hat{q}(x_t,a_t)\). When None is given, model-dependent estimators such as DM and DR cannot be used.

Returns

policy_value_dict – Dictionary containing estimated policy values by OPE estimators.

Return type

Dict[str, float]

evaluate_performance_of_estimators(ground_truth_policy_value: float, action_dist: numpy.ndarray, estimated_rewards_by_reg_model: Optional[numpy.ndarray] = None, metric: str = 'relative-ee') → Dict[str, float][source]¶

Evaluate estimation performances of OPE estimators.

Note

Evaluate the estimation performances of OPE estimators by relative estimation error (relative-EE) or squared error (SE):

\[\text{Relative-EE} (\hat{V}; \mathcal{D}) = \left| \frac{\hat{V}(\pi; \mathcal{D}) - V(\pi)}{V(\pi)} \right|,\]

\[\text{SE} (\hat{V}; \mathcal{D}) = \left(\hat{V}(\pi; \mathcal{D}) - V(\pi) \right)^2,\]

where \(V({\pi})\) is the ground-truth policy value of the evalation policy \(\pi_e\) (often estimated using on-policy estimation). \(\hat{V}(\pi; \mathcal{D})\) is an estimated policy value by an OPE estimator \(\hat{V}\) and logged bandit feedback \(\mathcal{D}\).

Parameters

ground_truth policy value (float) – Ground_truth policy value of an evaluation policy, i.e., \(V(\pi)\). With Open Bandit Dataset, in general, we use an on-policy estimate of the policy value as its ground-truth.
action_dist (array-like, shape (n_rounds, n_actions, len_list)) – Action choice probabilities by the evaluation policy (can be deterministic), i.e., \(\pi_e(a_t|x_t)\).
estimated_rewards_by_reg_model (array-like, shape (n_rounds, n_actions, len_list), default=None) – Expected rewards for each round, action, and position estimated by a regression model, i.e., \(\hat{q}(x_t,a_t)\). When it is not given, model-dependent estimators such as DM and DR cannot be used.
metric (str, default=”relative-ee”) – Evaluation metric to evaluate and compare the estimation performance of OPE estimators. Must be “relative-ee” or “se”.

Returns

eval_metric_ope_dict – Dictionary containing evaluation metric for evaluating the estimation performance of OPE estimators.

Return type

Dict[str, float]

summarize_estimators_comparison(ground_truth_policy_value: float, action_dist: numpy.ndarray, estimated_rewards_by_reg_model: Optional[numpy.ndarray] = None, metric: str = 'relative-ee') → pandas.core.frame.DataFrame[source]¶

Summarize performance comparisons of OPE estimators.

Parameters

ground_truth policy value (float) – Ground_truth policy value of an evaluation policy, i.e., \(V(\pi)\). With Open Bandit Dataset, in general, we use an on-policy estimate of the policy value as ground-truth.
action_dist (array-like, shape (n_rounds, n_actions, len_list)) – Action choice probabilities by the evaluation policy (can be deterministic), i.e., \(\pi_e(a_t|x_t)\).
estimated_rewards_by_reg_model (array-like, shape (n_rounds, n_actions, len_list), default=None) – Expected rewards for each round, action, and position estimated by a regression model, i.e., \(\hat{q}(x_t,a_t)\). When it is not given, model-dependent estimators such as DM and DR cannot be used.
metric (str, default=”relative-ee”) – Evaluation metric to evaluate and compare the estimation performance of OPE estimators. Must be either “relative-ee” or “se”.

Returns

eval_metric_ope_df – Evaluation metric for evaluating the estimation performance of OPE estimators.

Return type

DataFrame

summarize_off_policy_estimates(action_dist: numpy.ndarray, estimated_rewards_by_reg_model: Optional[numpy.ndarray] = None, alpha: float = 0.05, n_bootstrap_samples: int = 100, random_state: Optional[int] = None) → Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame][source]¶

Summarize policy values estimated by OPE estimators and their confidence intervals.

Parameters

action_dist (array-like, shape (n_rounds, n_actions, len_list)) – Action choice probabilities by the evaluation policy (can be deterministic), i.e., \(\pi_e(a_t|x_t)\).
estimated_rewards_by_reg_model (array-like, shape (n_rounds, n_actions, len_list), default=None) – Expected rewards for each round, action, and position estimated by a regression model, i.e., \(\hat{q}(x_t,a_t)\). When it is not given, model-dependent estimators such as DM and DR cannot be used.
alpha (float, default=0.05) – P-value.
n_bootstrap_samples (int, default=100) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None) – Controls the random seed in bootstrap sampling.

Returns

(policy_value_df, policy_value_interval_df) – Estimated policy values and their confidence intervals by OPE estimators.

Return type

Tuple[DataFrame, DataFrame]

visualize_off_policy_estimates(action_dist: numpy.ndarray, estimated_rewards_by_reg_model: Optional[numpy.ndarray] = None, alpha: float = 0.05, is_relative: bool = False, n_bootstrap_samples: int = 100, random_state: Optional[int] = None, fig_dir: Optional[pathlib.Path] = None, fig_name: str = 'estimated_policy_value.png') → None[source]¶

Visualize policy values estimated by OPE estimators.

Parameters

action_dist (array-like, shape (n_rounds, n_actions, len_list)) – Action choice probabilities by the evaluation policy (can be deterministic), i.e., \(\pi_e(a_t|x_t)\).
estimated_rewards_by_reg_model (array-like, shape (n_rounds, n_actions, len_list), default=None) – Expected rewards for each round, action, and position estimated by a regression model, i.e., \(\hat{q}(x_t,a_t)\). When it is not given, model-dependent estimators such as DM and DR cannot be used.
alpha (float, default=0.05) – P-value.
n_bootstrap_samples (int, default=100) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None) – Controls the random seed in bootstrap sampling.
is_relative (bool, default=False,) – If True, the method visualizes the estimated policy values of evaluation policy relative to the ground-truth policy value of behavior policy.
fig_dir (Path, default=None) – Path to store the bar figure. If ‘None’ is given, the figure will not be saved.
fig_name (str, default=”estimated_policy_value.png”) – Name of the bar figure.