Quickstart

We show an example of conducting offline evaluation of the performance of Bernoulli Thompson Sampling (BernoulliTS) as an evaluation policy using Inverse Probability Weighting (IPW) and logged bandit feedback generated by the Random policy (behavior policy). We see that only ten lines of code are sufficient to complete OPE from scratch. In this example, it is assumed that the obd/random/all directory exists under the present working directory. Please clone the repository in advance.

# a case for implementing OPE of the BernoulliTS policy using log data generated by the Random policy
>>> from obp.dataset import OpenBanditDataset
>>> from obp.policy import BernoulliTS
>>> from obp.ope import OffPolicyEvaluation, InverseProbabilityWeighting as IPW

# (1) Data loading and preprocessing
>>> dataset = OpenBanditDataset(behavior_policy='random', campaign='all')
>>> bandit_feedback = dataset.obtain_batch_bandit_feedback()

# (2) Off-Policy Learning
>>> evaluation_policy = BernoulliTS(
    n_actions=dataset.n_actions,
    len_list=dataset.len_list,
    is_zozotown_prior=True,
    campaign="all",
    random_state=12345
)
>>> action_dist = evaluation_policy.compute_batch_action_dist(
    n_sim=100000, n_rounds=bandit_feedback["n_rounds"]
)

# (3) Off-Policy Evaluation
>>> ope = OffPolicyEvaluation(bandit_feedback=bandit_feedback, ope_estimators=[IPW()])
>>> estimated_policy_value = ope.estimate_policy_values(action_dist=action_dist)

# estimated performance of BernoulliTS relative to the ground-truth performance of Random
>>> relative_policy_value_of_bernoulli_ts = estimated_policy_value['ipw'] / bandit_feedback['reward'].mean()
>>> print(relative_policy_value_of_bernoulli_ts)
1.198126...

A detailed introduction with the same example can be found at quickstart. Below, we explain some important features in the example flow.

Data loading and preprocessing

We prepare an easy-to-use data loader for Open Bandit Dataset.

# load and preprocess raw data in "ALL" campaign collected by the Random policy
>>> dataset = OpenBanditDataset(behavior_policy='random', campaign='all')
# obtain logged bandit feedback generated by the behavior policy
>>> bandit_feedback = dataset.obtain_batch_bandit_feedback()

>>> print(bandit_feedback.keys())
dict_keys(['n_rounds', 'n_actions', 'action', 'position', 'reward', 'pscore', 'context', 'action_context'])

Users can implement their own feature engineering in the pre_process method of obp.dataset.OpenBanditDataset class. We show an example of implementing some new feature engineering processes in custom_dataset.py.

Moreover, by following the interface of obp.dataset.BaseBanditDataset class, one can handle their own or future open datasets for bandit algorithms other than our OBD.

Off-Policy Learning

After preparing a dataset, we now compute the action choice probability of BernoulliTS in the ZOZOTOWN production. Then, we can use it as the evaluation policy.

# define evaluation policy (the Bernoulli TS policy here)
# by activating the `is_zozotown_prior` argument of BernoulliTS, we can replicate BernoulliTS used in ZOZOTOWN production.
>>> evaluation_policy = BernoulliTS(
    n_actions=dataset.n_actions,
    len_list=dataset.len_list,
    is_zozotown_prior=True, # replicate the policy in the ZOZOTOWN production
    campaign="all",
    random_state=12345
)
# compute the distribution over actions by the evaluation policy using Monte Carlo simulation
# action_dist is an array of shape (n_rounds, n_actions, len_list)
# representing the distribution over actions made by the evaluation policy
>>> action_dist = evaluation_policy.compute_batch_action_dist(
    n_sim=100000, n_rounds=bandit_feedback["n_rounds"]
)

The compute_batch_action_dist method of BernoulliTS computes the action choice probabilities based on given hyperparameters of the beta distribution. action_dist is an array representing the distribution over actions made by the evaluation policy.

Off-Policy Evaluation

Our final step is off-policy evaluation (OPE), which attempts to estimate the performance of decision making policy using log data generated by offline bandit simulation. Our pipeline also provides an easy procedure for doing OPE as follows.

# estimate the policy value of BernoulliTS based on the distribution over actions by that policy
# it is possible to set multiple OPE estimators to the `ope_estimators` argument
>>> ope = OffPolicyEvaluation(bandit_feedback=bandit_feedback, ope_estimators=[ReplayMethod()])
>>> estimated_policy_value = ope.estimate_policy_values(action_dist=action_dist)
>>> print(estimated_policy_value)
{'ipw': 0.004553...} # dictionary containing estimated policy values by each OPE estimator.

# compare the estimated performance of BernoulliTS (evaluation policy)
# with the ground-truth performance of Random (behavior policy)
>>> relative_policy_value_of_bernoulli_ts = estimated_policy_value['ipw'] / bandit_feedback['reward'].mean()
# our OPE procedure suggests that BernoulliTS improves Random by 19.81%
>>> print(relative_policy_value_of_bernoulli_ts)
1.198126...

Users can implement their own OPE estimator by following the interface of obp.ope.BaseOffPolicyEstimator class. obp.ope.OffPolicyEvaluation class summarizes and compares the estimated policy values by several off-policy estimators. A detailed usage of this class can be found at quickstart. bandit_feedback['reward'].mean() is the empirical mean of factual rewards (on-policy estimate of the policy value) in the log and thus is the ground-truth performance of the behavior policy (the Random policy in this example.).