obp.dataset.multiclass

Class for Multi-Class Classification to Bandit Reduction.

Classes

MultiClassToBanditReduction(X, y, …)

Class for handling multi-class classification data as logged bandit feedback data.

class obp.dataset.multiclass.MultiClassToBanditReduction(X: numpy.ndarray, y: numpy.ndarray, base_classifier_b: sklearn.base.ClassifierMixin, alpha_b: float = 0.8, dataset_name: Optional[str] = None)[source]

Bases: obp.dataset.base.BaseSyntheticBanditDataset

Class for handling multi-class classification data as logged bandit feedback data.

Note

A machine learning classifier such as logistic regression is used to construct behavior and evaluation policies as follows.

  1. Split the original data into training (\(\mathcal{D}_{\mathrm{tr}}\)) and evaluation (\(\mathcal{D}_{\mathrm{ev}}\)) sets.

  2. Train classifiers on \(\mathcal{D}_{\mathrm{tr}}\) and obtain base deterministic policies \(\pi_{\mathrm{det},b}\) and \(\pi_{\mathrm{det},e}\).

  3. Construct behavior (\(\pi_{b}\)) and evaluation (\(\pi_{e}\)) policies based on \(\pi_{\mathrm{det},b}\) and \(\pi_{\mathrm{det},e}\) as

    \[\pi_b (a | x) := \alpha_b \cdot \pi_{\mathrm{det},b} (a|x) + (1.0 - \alpha_b) \cdot \pi_{u} (a|x)\]
    \[\pi_e (a | x) := \alpha_e \cdot \pi_{\mathrm{det},e} (a|x) + (1.0 - \alpha_e) \cdot \pi_{u} (a|x)\]

    where \(\pi_{u}\) is a uniform random policy and \(\alpha_b\) and \(\alpha_e\) are set by the user.

4. Measure the accuracy of the evaluation policy on \(\mathcal{D}_{\mathrm{ev}}\) with its fully observed rewards and use it as the evaluation policy’s ground truth policy value.

  1. Using \(\mathcal{D}_{\mathrm{ev}}\), an estimator \(\hat{V}\) estimates the policy value of the evaluation policy, i.e.,

    \[V(\pi_e) \approx \hat{V} (\pi_e; \mathcal{D}_{\mathrm{ev}})\]
  2. Evaluate the estimation performance of \(\hat{V}\) by comparing its estimate with the ground-truth policy value.

Parameters
  • X (array-like, shape (n_rounds,n_features)) – Training vector of the original multi-class classification data, where n_rounds is the number of samples and n_features is the number of features.

  • y (array-like, shape (n_rounds,)) – Target vector (relative to X) of the original multi-class classification data.

  • base_classifier_b (ClassifierMixin) – Machine learning classifier used to construct a behavior policy.

  • alpha_b (float, default=0.9) – Ration of a uniform random policy when constructing a behavior policy. Must be in the [0, 1) interval to make the behavior policy a stochastic one.

  • dataset_name (str, default=None) – Name of the dataset.

Examples

# evaluate the estimation performance of IPW using the `digits` data in sklearn
>>> import numpy as np
>>> from sklearn.datasets import load_digits
>>> from sklearn.linear_model import LogisticRegression
# import open bandit pipeline (obp)
>>> from obp.dataset import MultiClassToBanditReduction
>>> from obp.ope import OffPolicyEvaluation, InverseProbabilityWeighting as IPW

# load raw digits data
>>> X, y = load_digits(return_X_y=True)
# convert the raw classification data into the logged bandit dataset
>>> dataset = MultiClassToBanditReduction(
    X=X,
    y=y,
    base_classifier_b=LogisticRegression(random_state=12345),
    alpha_b=0.8,
    dataset_name="digits",
)
# split the original data into the training and evaluation sets
>>> dataset.split_train_eval(eval_size=0.7, random_state=12345)
# obtain logged bandit feedback generated by behavior policy
>>> bandit_feedback = dataset.obtain_batch_bandit_feedback(random_state=12345)
>>> bandit_feedback
{
    'n_actions': 10,
    'n_rounds': 1258,
    'context': array([[ 0.,  0.,  0., ..., 16.,  1.,  0.],
            [ 0.,  0.,  7., ..., 16.,  3.,  0.],
            [ 0.,  0., 12., ...,  8.,  0.,  0.],
            ...,
            [ 0.,  1., 13., ...,  8., 11.,  1.],
            [ 0.,  0., 15., ...,  0.,  0.,  0.],
            [ 0.,  0.,  4., ..., 15.,  3.,  0.]]),
    'action': array([6, 8, 5, ..., 2, 5, 9]),
    'reward': array([1., 1., 1., ..., 1., 1., 1.]),
    'position': array([0, 0, 0, ..., 0, 0, 0]),
    'pscore': array([0.82, 0.82, 0.82, ..., 0.82, 0.82, 0.82])
}

# obtain action choice probabilities by an evaluation policy and its ground-truth policy value
>>> action_dist = dataset.obtain_action_dist_by_eval_policy(
    base_classifier_e=LogisticRegression(C=100, random_state=12345),
    alpha_e=0.9,
)
>>> ground_truth = dataset.calc_ground_truth_policy_value(action_dist=action_dist)
>>> ground_truth
0.865643879173291

# off-policy evaluation using IPW
>>> ope = OffPolicyEvaluation(bandit_feedback=bandit_feedback, ope_estimators=[IPW()])
>>> estimated_policy_value = ope.estimate_policy_values(action_dist=action_dist)
>>> estimated_policy_value
{'ipw': 0.8662705029276045}

# evaluate the estimation performance (accuracy) of IPW by relative estimation error (relative-ee)
>>> relative_estimation_errors = ope.evaluate_performance_of_estimators(
        ground_truth_policy_value=ground_truth,
        action_dist=action_dist,
    )
>>> relative_estimation_errors
{'ipw': 0.000723881690137968}

References

Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. “Doubly Robust Policy Evaluation and Optimization.”, 2014.

calc_ground_truth_policy_value(action_dist: numpy.ndarray) → numpy.ndarray[source]

Calculate the ground-truth policy value of a given action distribution.

Parameters

action_dist (array-like, shape (n_rounds_ev, n_actions, 1)) – Action distribution or action choice probabilities of a policy whose ground-truth is to be caliculated here. where n_rounds_ev is the number of samples in the evaluation set given the current train-eval split. n_actions is the number of actions. axis 2 of action_dist represents the length of list; it is always 1 in the current implementation.

Returns

ground_truth_policy_value – policy value of a given action distribution (mostly evaluation policy).

Return type

float

obtain_action_dist_by_eval_policy(base_classifier_e: Optional[sklearn.base.ClassifierMixin] = None, alpha_e: float = 1.0) → numpy.ndarray[source]

Obtain action choice probabilities by an evaluation policy.

Parameters
  • base_classifier_e (ClassifierMixin, default=None) – Machine learning classifier used to construct a behavior policy.

  • alpha_e (float, default=1.0) – Ration of a uniform random policy when constructing an evaluation policy. Must be in the [0, 1] interval (evaluation policy can be deterministic).

Returns

action_dist_by_eval_policy – action_dist_by_eval_policy is an action choice probabilities by an evaluation policy. where n_rounds_ev is the number of samples in the evaluation set given the current train-eval split. n_actions is the number of actions. axis 2 represents the length of list; it is always 1 in the current implementation.

Return type

array-like, shape (n_rounds_ev, n_actions, 1)

obtain_batch_bandit_feedback(random_state: Optional[int] = None) → Dict[str, Union[int, numpy.ndarray]][source]

Obtain batch logged bandit feedback, an evaluation policy, and its ground-truth policy value.

Note

Please call self.split_train_eval() before calling this method.

Parameters
  • eval_size (float or int, default=0.25) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples.

  • random_state (int, default=None) – Controls the random seed in sampling actions.

Returns

bandit_feedback – bandit_feedback is logged bandit feedback data generated from a multi-class classification dataset.

Return type

BanditFeedback

split_train_eval(eval_size: Union[int, float] = 0.25, random_state: Optional[int] = None) → None[source]

Split the original data into the training (used for policy learning) and evaluation (used for OPE) sets.

Parameters
  • eval_size (float or int, default=0.25) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the evaluation split. If int, represents the absolute number of test samples.

  • random_state (int, default=None) – Controls the random seed in train-evaluation split.

property len_list

Length of recommendation lists.

property n_actions

Number of actions (number of classes).

property n_rounds

Number of samples in the original multi-class classification data.