_images/logo.png

Open Bandit Pipeline; a python library for bandit algorithms and off-policy evaluation

Overview

Open Bandit Pipeline (OBP) is an open source python library for bandit algorithms and off-policy evaluation (OPE). The toolkit comes with the Open Bandit Dataset , a large-scale logged bandit feedback data collected on a fashion e-commerce platform, ZOZOTOWN. The purpose of the open data and library is to enable easy, realistic, and reproducible evaluation of bandit algorithms and OPE. OBP has a series of implementations of dataset preprocessing, bandit policy interfaces, and a variety of OPE estimators.

Our open data and pipeline facilitate evaluation and comparison related to the following research topics.

  • Bandit Algorithms: Our data include the probabilities of each action being selected by behavior policies (the true propensity scores).

Therefore, it enables the evaluation of new online bandit algorithms, including contextual and combinatorial algorithms, in a large real-world setting.

  • Off-Policy Evaluation: We present implementations of behavior policies used when collecting datasets as a part of our pipeline.

Our open data also contains logged bandit feedback data generated by multiple behavior policies. Therefore, it enables the evaluation of off-policy evaluation with ground-truths for the performances of evaluation policies.

This website contains pages with example analyses to help demonstrate the usage of this library. Additionally, it presents examples of evaluating counterfactual bandit algorithms and OPE itself. The reference page contains the full reference documentation for the current functions of this toolkit.

Algorithms and OPE Estimators Supported

Bandit Algorithms

  • Online

    • Context-free

      • Random

      • Epsilon Greedy

      • Bernoulli Thompson Sampling

    • Contextual (Linear)

      • Linear Epsilon Greedy

      • Linear Thompson Sampling [10]

      • Linear Upper Confidence Bound [11]

    • Contextual (Logistic)

      • Logistic Epsilon Greedy

      • Logistic Thompson Sampling [12]

      • Logistic Upper Confidence Bound [13]

  • Offline (Off-Policy Learning) [4]

    • Inverse Probability Weighting

OPE Estimators

  • Replay Method (RM) [14]

  • Direct Method (DM) [15]

  • Inverse Probability Weighting (IPW) [2] [3]

  • Self-Normalized Inverse Probability Weighting (SNIPW) [16]

  • Doubly Robust (DR) [4]

  • Switch Estimators [8]

  • Doubly Robust with Optimistic Shrinkage (DRos) [9]

  • More Robust Doubly Robust (MRDR) [1]

  • Double Machine Learning (DML) [17]

Citation

If you use our dataset and pipeline in your work, please cite our paper below.

@article{saito2020large,

title={Large-scale Open Dataset, Pipeline, and Benchmark for Bandit Algorithms}, author={Saito, Yuta, Shunsuke Aihara, Megumi Matsutani, Yusuke Narita}, journal={arXiv preprint arXiv:2008.07146}, year={2020}

}

Google Group

If you are interested in the Open Bandit Project, we can follow the updates at its google group: https://groups.google.com/g/open-bandit-project

Contact

For any question about the paper, data, and pipeline, feel free to contact: saito@hanjuku-kaso.com

Table of Contents

About

Motivated by the paucity of real-world data and implementation enabling the evaluation and comparison of OPE, we release the following open-source dataset and pipeline software for research uses.

Open Bandit Dataset (OBD)

Open Bandit Dataset is a public real-world logged bandit feedback data. The dataset is provided by ZOZO, Inc., the largest Japanese fashion e-commerce company with over 5 billion USD market capitalization (as of May 2020). The company uses multi-armed bandit algorithms to recommend fashion items to users in a large-scale fashion e-commerce platform called ZOZOTOWN. The following figure presents examples of displayed fashion items as actions.

_images/recommended_fashion_items.png

We collected the data in a 7-days experiment in late November 2019 on three campaigns, corresponding to “all”, “men’s”, and “women’s” items, respectively. Each campaign randomly uses either the Random policy or the Bernoulli Thompson Sampling (Bernoulli TS) policy for each user impression. Note that we pre-trained Bernoulli TS for over a month before the data collection process and the policy well converges to a fixed one. Thus, we suppose our data is generated by a fixed policy and apply the standard OPE formulation that assumes static behavior and evaluation policies. These policies select three of the possible fashion items to each user. Let \(\mathcal{I}:=\{0,\ldots,n\}\) be a set of \(n+1\) items and \(\mathcal{K}:=\{0,\ldots,k\}\) be a set of \(k+1\) positions. The above figure shows that \(k+1=3\) for our data. We assume that the reward (click indicator) depends only on the item and its position, which is a general assumption on the click generative model in the web industry:cite:Li2018. Under the assumption, the action space is simply the product of the item set and the position set, i.e., \(\calA = \mathcal{I} \times \mathcal{K}\). Then, we can apply the standard OPE setup and estimators to our setting. We describe some statistics of the dataset in the following.

_images/statistics_of_obd.png

The data is large and contains many millions of recommendation instances. It also includes the true action choice probabilities by behavior policies computed by Monte Carlo simulations based on the policy parameters (e.g., parameters of the beta distribution used by Bernoulli TS) used during the data collection process. The number of actions is also sizable, so this setting is challenging for bandit algorithms and their OPE. We share the full version of our data at https://research.zozo.com/data.html

Open Bandit Pipeline (OBP)

Open Bandit Pipeline is a series of implementations of dataset preprocessing, policy learning, and evaluation of OPE estimators. This pipeline allows researchers to focus on building their bandit algorithm or OPE estimator and easily compare them with others’ methods in realistic and reproducible ways. Thus, it facilitates reproducible research on bandit algorithms and off-policy evaluation.

_images/overview.png

Open Bandit Pipeline consists of the following main modules.

  • dataset module: This module provides a data loader for Open Bandit Dataset and a flexible interface for handling logged bandit feedback. It also provides tools to generate synthetic bandit datasets.

  • policy module: This module provides interfaces for online and offline bandit algorithms. It also implements several standard algorithms.

  • simulator module: This module provides functions for conducting offline bandit simulation.

  • ope module: This module provides interfaces for OPE estimators. It also implements several standard OPE estimators.

In addition to the above algorithms and estimators, the pipeline also provides flexible interfaces. Therefore, researchers can easily implement their own algorithms or estimators and evaluate them with our data and pipeline. Moreover, the pipeline provides an interface for handling logged bandit feedback datasets. Thus, practitioners can combine their own datasets with the pipeline and easily evaluate bandit algorithms’ performances in their settings.

Please see package reference for detailed information about Open Bandit Pipeline.

To our knowledge, our real-world dataset and pipeline are the first to include logged bandit datasets collected by running multiple different policies, policy implementations used in production, and their ground-truth policy values. These features enable the evaluation of OPE for the first time.

Overview

Setup

We consider a general contextual bandit setting. Let \(r \in [0, R_{\mathrm{max}}]\) denote a reward or outcome variable (e.g., whether a fashion item as an action results in a click). We let \(x \in \calX\) be a context vector (e.g., the user’s demographic profile) that the decision maker observes when picking an action. Rewards and contexts are sampled from the unknown probability distributions \(p (r \mid x, a)\) and \(p(x)\), respectively. Let \(\calA:=\{0,\ldots,m\}\) be a finite set of \(m+1\) actions. We call a function \(\pi: \calX \rightarrow \Delta(\calA)\) a policy. It maps each context \(x \in \calX\) into a distribution over actions, where \(\pi (a \mid x)\) is the probability of taking action \(a\) given \(x\).

Let \(\calD := \{(x_t,a_t,r_t)\}_{t=1}^{T} \) rounds of observations. \(a_t\) is a discrete variable indicating which action in \(\calA\) is chosen in round \(t\). \(r_t\) and \(x_t\) denote the reward and the context observed in round \(t\), respectively. We assume that a logged bandit feedback is generated by a behavior policy \(\pi_b\) as follows:

\[\{(x_t,a_t,r_t)\}_{i=1}^{T} \sim \prod_{i=1}^{T} p(x_t) \pi_b (a_t \mid x_t) p(r_t \mid x_t, a_t),\]

where each context-action-reward triplets are sampled independently from the product distribution. Note that we assume \(a_t\) is independent of \(r_t\) conditional on \(x_t\).

We let \(\pi(x,a,r) := p(x) \pi (a \mid x) p(r \mid x, a)\) be the product distribution by a policy \(\pi\). For a function \(f(x,a,r)\), we use \(\E_{\calD} [f] := |\calD|^{-1} \sum_{(x_t, a_t, r_t) \in \calD} f(x_t, a_t, r_t)\) to denote its empirical expectation over \(T\) observations in \(\calD\). Then, for a function \(g(x,a)\), we let \(g(x,\pi) := \E_{a \sim \pi(a|x)}[g(x,a) \mid x]\). We also use \(q(x,a) := \E_{r \sim p(r|x,a)} [ r \mid x, a ]\) to denote the mean reward function.

Estimation Target

We are interested in using the historical logged bandit data to estimate the following policy value of any given evaluation policy \(\pi_e\) which might be different from \(\pi_b\):

\[V (\pi_e) := \E_{(x,a,r) \sim \pi_e (x,a,r)} [r] .\]

where the last equality uses the independence of \(A\) and \(Y(\cdot)\) conditional on \(X\) and the definition of \(\pi_b(\cdot|X)\). We allow the evaluation policy \(\pi_e\) to be degenerate, i.e., it may choose a particular action with probability 1. Estimating \(V(\pi_e)\) before implementing \(\pi_e\) in an online environment is valuable because \(\pi_e\) may perform poorly and damage user satisfaction. Additionally, it is possible to select an evaluation policy that maximizes the policy value by comparing their estimated performances without having additional implementation cost.

Estimators

Direct Method (DM)

A widely-used method, DM, first learns a supervised machine learning model, such as random forest, ridge regression, and gradient boosting, to estimate the mean reward function. DM then uses it to estimate the policy value as

\[\hat{V}_{\mathrm{DM}} (\pi_e; \calD, \hat{q}) := \E_{\calD} [ \hat{q} (x_t, \pi_e) ],\]

where \(\hat{q}(a \mid x)\) is the estimated reward function. If \(\hat{q}(a \mid x)\) is a good approximation to the mean reward function, this estimator accurately estimates the policy value of the evaluation policy \(V^{\pi}\). If \(\hat{q}(a \mid x)\) fails to approximate the mean reward function well, however, the final estimator is no longer consistent. The model misspecification issue is problematic because the extent of misspecification cannot be easily quantified from data [1].

Inverse Probability Weighting (IPW)

To alleviate the issue with DM, researchers often use another estimator called IPW [2] [3]. IPW re-weights the rewards by the ratio of the evaluation policy and behavior policy as

\[\hat{V}_{\mathrm{IPW}} (\pi_e; \calD) := \E_{\calD} [w(x_t,a_t) r_t ],\]

where \(w(x,a) := \pi_e(a \mid x) / \pi_b(a \mid x)\) is the importance weight given \(x\) and \(a\). When the behavior policy is known, the IPW estimator is unbiased and consistent for the policy value. However, it can have a large variance, especially when the evaluation policy significantly deviates from the behavior policy.

Doubly Robust (DR)

The final approach is DR [4], which combines the above two estimators as

\[\hat{V}_{\mathrm{DR}} := \E_{\calD} [ \hat{q} (x_t, \pi_e) + w(x_t,a_t) (r_t-\hat{q}(x_t, a_t) ) ].\]

DR mimics IPW to use a weighted version of rewards, but DR also uses the estimated mean reward function as a control variate to decrease the variance. It preserves the consistency of IPW if either the importance weight or the mean reward estimator is accurate (a property called double robustness). Moreover, DR is semiparametric efficient [5] when the mean reward estimator is correctly specified. On the other hand, when it is wrong, this estimator can have larger asymptotic mean-squared-error than IPW [6] and perform poorly in practice [7].

Self-Normalized Estimators

Self-Normalized Inverse Probability Weighting (SNIPW) is an approach to address the variance issue with the original IPW. It estimates the policy value by dividing the sum of weighted rewards by the sum of importance weights as:

\[\hat{V}_{\mathrm{SNIPW}} (\pi_e; \calD) :=\frac{\E_{\calD} [ w(x_t,a_t) r_t ]}{\E_{\calD} [ w(x_t,a_t) ]}.\]

SNIPW is more stable than IPW, because estimated policy value by SNIPW is bounded in the support of rewards and its conditional variance given action and context is bounded by the conditional variance of the rewards:cite:kallus2019. IPW does not have these properties. We can define Self-Normalized Doubly Robust (SNDR) in a similar manner as follows.

\[\hat{V}_{\mathrm{SNDR}} (\pi_e; \calD) := \E_{\calD} \left[\hat{q}(x_t, \pi_e) + \frac{w(x_t,a_t) (r_t-\hat{q}(x_t, a_t) )}{\E_{\calD} [ w(x_t,a_t) ]} \right].\]

Switch Estimators

The DR estimator can still be subject to the variance issue, particularly when the importance weights are large due to low overlap. Switch-DR aims to reduce the effect of the variance issue by using DM where importance weights are large as:

\[\hat{V}_{\mathrm{SwitchDR}} (\pi_e; \calD, \hat{q}, \tau) := \E_{\calD} \left[ \hat{q}(x_t, \pi_e) + w(x_t,a_t) (r_t-\hat{q}(x_t, a_t) ) \mathbb{I}\{ w(x_t,a_t) \le \tau \} \right],\]

where \(\mathbb{I} \{\cdot\}\) is the indicator function and \(\tau \ge 0\) is a hyperparameter. Switch-DR interpolates between DM and DR. When \(\tau=0\), it coincides with DM, while \(\tau \to \infty\) yields DR. This estimator is minimax optimal when \(\tau\) is appropriately chosen [8].

More Robust Doubly Robust (MRDR)

MRDR uses a specialized reward estimator (\(\hat{q}_{\mathrm{MRDR}}\)) that minimizes the variance of the resulting policy value estimator:cite:Farajtabar2018. This estimator estimates the policy value as:

\[\hat{V}_{\mathrm{MRDR}} (\pi_e; \calD, \hat{q}_{\mathrm{MRDR}}) := \hat{V}_{\mathrm{DR}} (\pi_e; \calD, \hat{q}_{\mathrm{MRDR}}),\]

where \(\mathcal{Q}\) is a function class for the reward estimator. When \(\mathcal{Q}\) is well-specified, then \(\hat{q}_{\mathrm{MRDR}} = q\). Here, even if \(\mathcal{Q}\) is misspecified, the derived reward estimator is expected to behave well since the target function is the resulting variance.

Doubly Robust with Optimistic Shrinkage (DRos)

[9] proposes DRs based on a new weight function \(w_o: \calX \times \calA \rightarrow \mathbb{R}_{+}\) that directly minimizes sharp bounds on the MSE of the resulting estimator. DRs is defined as

\[\hat{V}_{\mathrm{DRs}} (\pi_e; \calD, \hat{q}, \lambda) := \E_{\calD} [ \hat{q} (x_t, \pi_e) + w_o (x_t, a_t; \lambda) (r_t-\hat{q}(x_t, a_t) ) ],\]

where \(\lambda \ge 0\) is a hyperparameter and the new weight is

\[w_o (x, a; \lambda) := \frac{\lambda}{w^{2}(x, a)+\lambda} w(x, a).\]

When \(\lambda = 0\), \(w_o (x, a; \lambda) = 0\) leading to the standard DM. On the other hand, as \(\lambda \rightarrow \infty\), \(w_o (x, a; \lambda) = w(x,a)\) leading to the original DR.

Evaluation of OPE

Here we describe an experimental protocol to evaluate OPE estimators and use it to compare a wide variety of existing estimators.

We can empirically evaluate OPE estimators’ performances by using two sources of logged bandit feedback collected by two different policies \(\pi^{(he)}\) (hypothetical evaluation policy) and \(\pi^{(hb)}\) (hypothetical behavior policy). We denote log data generated by \(\pi^{(he)}\) and \(\pi^{(hb)}\) as \(\calD^{(he)} := \{ (x^{(he)}_t, a^{(he)}_t, r^{(he)}_t) \}_{t=1}^T\) and \(\calD^{(hb)} := \{ (x^{(hb)}_t, a^{(hb)}_t, r^{(hb)}_t) \}_{t=1}^T\), respectively. By applying the following protocol to several different OPE estimators, we can compare their estimation performances:

  1. Define the evaluation and test sets as:

    • in-sample case: \(\calD_{\mathrm{ev}} := \calD^{(hb)}_{1:T}\), \(\calD_{\mathrm{te}} := \calD^{(he)}_{1:T}\)

    • out-sample case: \(\calD_{\mathrm{ev}} := \calD^{(hb)}_{1:\tilde{t}}\), \(\calD_{\mathrm{te}} := \calD^{(he)}_{\tilde{t}+1:T}\)

    where \(\calD_{a:b} := \{ (x_t,a_t,r_t) \}_{t=a}^{b}\).

  2. Estimate the policy value of \(\pi^{(he)}\) using \(\calD_{\mathrm{ev}}\) by an estimator \(\hat{V}\). We can represent an estimated policy value by \(\hat{V}\) as \(\hat{V} (\pi^{(he)}; \calD_{\mathrm{ev}})\).

  3. Estimate \(V(\pi^{(he)})\) by the on-policy estimation and regard it as the ground-truth as

    \[V_{\mathrm{on}} (\pi^{(he)}; \calD_{\mathrm{te}}) := \E_{\calD_{\mathrm{te}}} [r^{(he)}_t].\]
  4. Compare the off-policy estimate \(\hat{V}(\pi^{(he)}; \calD_{\mathrm{ev}})\) with its ground-truth \(V_{\mathrm{on}} (\pi^{(he)}; \calD_{\mathrm{te}})\). We can evaluate the estimation accuracy of \(\hat{V}\) by the following relative estimation error (relative-EE):

    \[\textit{relative-EE} (\hat{V}; \calD_{\mathrm{ev}}) := \left| \frac{\hat{V} (\pi^{(he)}; \calD_{\mathrm{ev}}) - V_{\mathrm{on}} (\pi^{(he)}; \calD_{\mathrm{te}}) }{V_{\mathrm{on}} (\pi^{(he)}; \calD_{\mathrm{te}})} \right|.\]
  5. To estimate standard deviation of relative-EE, repeat the above process several times with different bootstrap samples of the logged bandit data created by sampling data with replacement from \(\calD_{\mathrm{ev}}\).

We call the problem setting without the sample splitting by time series as in-sample case. In contrast, we call that with the sample splitting as out-sample case where OPE estimators aim to estimate the policy value of an evaluation policy in the test data.

The following algorithm describes the detailed experimental protocol to evaluate OPE estimators.

_images/evaluation_of_ope_algo.png

Using the above protocol, our real-world data, and pipeline, we have performed extensive benchmark experiments on a variety of existing off-policy estimators. The experimental results and the relevant discussion can be found in our paper. The code for running the benchmark experiments can be found at zr-obp/benchmark/ope.

Installation

obp is available on PyPI, and can be installed from pip or source as follows:

From pip:

pip install obp

From source:

git clone https://github.com/st-tech/zr-obp
cd zr-obp
python setup.py install

Quickstart

We show an example of conducting offline evaluation of the performance of Bernoulli Thompson Sampling (BernoulliTS) as an evaluation policy using Inverse Probability Weighting (IPW) and logged bandit feedback generated by the Random policy (behavior policy). We see that only ten lines of code are sufficient to complete OPE from scratch. In this example, it is assumed that the obd/random/all directory exists under the present working directory. Please clone the repository in advance.

# a case for implementing OPE of the BernoulliTS policy using log data generated by the Random policy
>>> from obp.dataset import OpenBanditDataset
>>> from obp.policy import BernoulliTS
>>> from obp.ope import OffPolicyEvaluation, InverseProbabilityWeighting as IPW

# (1) Data loading and preprocessing
>>> dataset = OpenBanditDataset(behavior_policy='random', campaign='all')
>>> bandit_feedback = dataset.obtain_batch_bandit_feedback()

# (2) Off-Policy Learning
>>> evaluation_policy = BernoulliTS(
    n_actions=dataset.n_actions,
    len_list=dataset.len_list,
    is_zozotown_prior=True,
    campaign="all",
    random_state=12345
)
>>> action_dist = evaluation_policy.compute_batch_action_dist(
    n_sim=100000, n_rounds=bandit_feedback["n_rounds"]
)

# (3) Off-Policy Evaluation
>>> ope = OffPolicyEvaluation(bandit_feedback=bandit_feedback, ope_estimators=[IPW()])
>>> estimated_policy_value = ope.estimate_policy_values(action_dist=action_dist)

# estimated performance of BernoulliTS relative to the ground-truth performance of Random
>>> relative_policy_value_of_bernoulli_ts = estimated_policy_value['ipw'] / bandit_feedback['reward'].mean()
>>> print(relative_policy_value_of_bernoulli_ts)
1.198126...

A detailed introduction with the same example can be found at quickstart. Below, we explain some important features in the example flow.

Data loading and preprocessing

We prepare an easy-to-use data loader for Open Bandit Dataset.

# load and preprocess raw data in "ALL" campaign collected by the Random policy
>>> dataset = OpenBanditDataset(behavior_policy='random', campaign='all')
# obtain logged bandit feedback generated by the behavior policy
>>> bandit_feedback = dataset.obtain_batch_bandit_feedback()

>>> print(bandit_feedback.keys())
dict_keys(['n_rounds', 'n_actions', 'action', 'position', 'reward', 'pscore', 'context', 'action_context'])

Users can implement their own feature engineering in the pre_process method of obp.dataset.OpenBanditDataset class. We show an example of implementing some new feature engineering processes in custom_dataset.py.

Moreover, by following the interface of obp.dataset.BaseBanditDataset class, one can handle their own or future open datasets for bandit algorithms other than our OBD.

Off-Policy Learning

After preparing a dataset, we now compute the action choice probability of BernoulliTS in the ZOZOTOWN production. Then, we can use it as the evaluation policy.

# define evaluation policy (the Bernoulli TS policy here)
# by activating the `is_zozotown_prior` argument of BernoulliTS, we can replicate BernoulliTS used in ZOZOTOWN production.
>>> evaluation_policy = BernoulliTS(
    n_actions=dataset.n_actions,
    len_list=dataset.len_list,
    is_zozotown_prior=True, # replicate the policy in the ZOZOTOWN production
    campaign="all",
    random_state=12345
)
# compute the distribution over actions by the evaluation policy using Monte Carlo simulation
# action_dist is an array of shape (n_rounds, n_actions, len_list)
# representing the distribution over actions made by the evaluation policy
>>> action_dist = evaluation_policy.compute_batch_action_dist(
    n_sim=100000, n_rounds=bandit_feedback["n_rounds"]
)

The compute_batch_action_dist method of BernoulliTS computes the action choice probabilities based on given hyperparameters of the beta distribution. action_dist is an array representing the distribution over actions made by the evaluation policy.

Off-Policy Evaluation

Our final step is off-policy evaluation (OPE), which attempts to estimate the performance of decision making policy using log data generated by offline bandit simulation. Our pipeline also provides an easy procedure for doing OPE as follows.

# estimate the policy value of BernoulliTS based on the distribution over actions by that policy
# it is possible to set multiple OPE estimators to the `ope_estimators` argument
>>> ope = OffPolicyEvaluation(bandit_feedback=bandit_feedback, ope_estimators=[ReplayMethod()])
>>> estimated_policy_value = ope.estimate_policy_values(action_dist=action_dist)
>>> print(estimated_policy_value)
{'ipw': 0.004553...} # dictionary containing estimated policy values by each OPE estimator.

# compare the estimated performance of BernoulliTS (evaluation policy)
# with the ground-truth performance of Random (behavior policy)
>>> relative_policy_value_of_bernoulli_ts = estimated_policy_value['ipw'] / bandit_feedback['reward'].mean()
# our OPE procedure suggests that BernoulliTS improves Random by 19.81%
>>> print(relative_policy_value_of_bernoulli_ts)
1.198126...

Users can implement their own OPE estimator by following the interface of obp.ope.BaseOffPolicyEstimator class. obp.ope.OffPolicyEvaluation class summarizes and compares the estimated policy values by several off-policy estimators. A detailed usage of this class can be found at quickstart. bandit_feedback['reward'].mean() is the empirical mean of factual rewards (on-policy estimate of the policy value) in the log and thus is the ground-truth performance of the behavior policy (the Random policy in this example.).

OBP Package Reference

ope module

obp.ope.estimators

Off-Policy Estimators.

obp.ope.meta

Off-Policy Evaluation Class to Streamline OPE.

obp.ope.regression_model

Regression Model Class for Estimating Mean Reward Functions.

policy module

obp.policy.base

Base Interfaces for Bandit Algorithms.

obp.policy.contextfree

Context-Free Bandit Algorithms.

obp.policy.linear

Contextual Linear Bandit Algorithms.

obp.policy.logistic

Contextual Logistic Bandit Algorithms.

obp.policy.offline

Offline Bandit Algorithms.

dataset module

obp.dataset.base

Abstract Base Class for Logged Bandit Feedback.

obp.dataset.real

Dataset Class for Real-World Logged Bandit Feedback.

obp.dataset.synthetic

Class for Generating Synthetic Logged Bandit Feedback.

obp.dataset.multiclass

Class for Multi-Class Classification to Bandit Reduction.

simulator module

obp.simulator.simulator

Bandit Simulator.

others

obp.utils

Useful Tools.

References

Papers

1

Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust off-policy evaluation. In Proceedings of the 35th International Conference on Machine Learning, 1447–1456. 2018.

2

Doina Precup, Richard S. Sutton, and Satinder Singh. Eligibility Traces for Off-Policy Policy Evaluation. In Proceedings of the 17th International Conference on Machine Learning, 759–766. 2000.

3

Alex Strehl, John Langford, Lihong Li, and Sham M Kakade. Learning from Logged Implicit Exploration Data. In Advances in Neural Information Processing Systems, 2217–2225. 2010.

4

Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. Doubly Robust Policy Evaluation and Optimization. Statistical Science, 29:485–511, 2014.

5

Yusuke Narita, Shota Yasui, and Kohei Yata. Efficient counterfactual learning from bandit feedback. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 4634–4641. 2019.

6

Nathan Kallus and Masatoshi Uehara. Intrinsically efficient, stable, and bounded off-policy evaluation for reinforcement learning. In Advances in Neural Information Processing Systems. 2019.

7

Joseph DY Kang, Joseph L Schafer, and others. Demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science, 22(4):523–539, 2007.

8

Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudik. Optimal and Adaptive Off-policy Evaluation in Contextual Bandits. In Proceedings of the 34th International Conference on Machine Learning, 3589–3597. 2017.

9

Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudík. Doubly robust off-policy evaluation with shrinkage. arXiv preprint arXiv:1907.09623, 2019.

10

Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, 127–135. 2013.

11

Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A Contextual-bandit Approach to Personalized News Article Recommendation. In Proceedings of the 19th International Conference on World Wide Web, 661–670. ACM, 2010.

12

Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, 2249–2257. 2011.

13

Dhruv Kumar Mahajan, Rajeev Rastogi, Charu Tiwari, and Adway Mitra. Logucb: an explore-exploit algorithm for comments recommendation. In Proceedings of the 21st ACM international conference on Information and knowledge management, 6–15. 2012.

14

Lihong Li, Wei Chu, John Langford, Taesup Moon, and Xuanhui Wang. An Unbiased Offline Evaluation of Contextual Bandit Algorithms with Generalized Linear Models. In Journal of Machine Learning Research: Workshop and Conference Proceedings, volume 26, 19–36. 2012.

15

Alina Beygelzimer and John Langford. The offset tree for learning with partial labels. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 129–138. 2009.

16

Adith Swaminathan and Thorsten Joachims. The Self-normalized Estimator for Counterfactual Learning. In Advances in Neural Information Processing Systems, 3231–3239. 2015.

17

Yusuke Narita, Shota Yasui, and Kohei Yata. Off-policy bandit and reinforcement learning. arXiv preprint arXiv:2002.08536, 2020.

Projects

This project is strongly inspired by Open Graph Benchmark –a collection of benchmark datasets, data loaders, and evaluators for graph machine learning: [github] [project page] [paper].

Indices and tables