obp.dataset.multiclass¶
Class for Multi-Class Classification to Bandit Reduction.
Classes
|
Class for handling multi-class classification data as logged bandit feedback data. |
-
class
obp.dataset.multiclass.
MultiClassToBanditReduction
(X: numpy.ndarray, y: numpy.ndarray, base_classifier_b: sklearn.base.ClassifierMixin, alpha_b: float = 0.8, dataset_name: Optional[str] = None)[source]¶ Bases:
obp.dataset.base.BaseSyntheticBanditDataset
Class for handling multi-class classification data as logged bandit feedback data.
Note
A machine learning classifier such as logistic regression is used to construct behavior and evaluation policies as follows.
Split the original data into training (\(\mathcal{D}_{\mathrm{tr}}\)) and evaluation (\(\mathcal{D}_{\mathrm{ev}}\)) sets.
Train classifiers on \(\mathcal{D}_{\mathrm{tr}}\) and obtain base deterministic policies \(\pi_{\mathrm{det},b}\) and \(\pi_{\mathrm{det},e}\).
Construct behavior (\(\pi_{b}\)) and evaluation (\(\pi_{e}\)) policies based on \(\pi_{\mathrm{det},b}\) and \(\pi_{\mathrm{det},e}\) as
\[\pi_b (a | x) := \alpha_b \cdot \pi_{\mathrm{det},b} (a|x) + (1.0 - \alpha_b) \cdot \pi_{u} (a|x)\]\[\pi_e (a | x) := \alpha_e \cdot \pi_{\mathrm{det},e} (a|x) + (1.0 - \alpha_e) \cdot \pi_{u} (a|x)\]where \(\pi_{u}\) is a uniform random policy and \(\alpha_b\) and \(\alpha_e\) are set by the user.
4. Measure the accuracy of the evaluation policy on \(\mathcal{D}_{\mathrm{ev}}\) with its fully observed rewards and use it as the evaluation policy’s ground truth policy value.
Using \(\mathcal{D}_{\mathrm{ev}}\), an estimator \(\hat{V}\) estimates the policy value of the evaluation policy, i.e.,
\[V(\pi_e) \approx \hat{V} (\pi_e; \mathcal{D}_{\mathrm{ev}})\]Evaluate the estimation performance of \(\hat{V}\) by comparing its estimate with the ground-truth policy value.
- Parameters
X (array-like, shape (n_rounds,n_features)) – Training vector of the original multi-class classification data, where n_rounds is the number of samples and n_features is the number of features.
y (array-like, shape (n_rounds,)) – Target vector (relative to X) of the original multi-class classification data.
base_classifier_b (ClassifierMixin) – Machine learning classifier used to construct a behavior policy.
alpha_b (float, default=0.9) – Ration of a uniform random policy when constructing a behavior policy. Must be in the [0, 1) interval to make the behavior policy a stochastic one.
dataset_name (str, default=None) – Name of the dataset.
Examples
# evaluate the estimation performance of IPW using the `digits` data in sklearn >>> import numpy as np >>> from sklearn.datasets import load_digits >>> from sklearn.linear_model import LogisticRegression # import open bandit pipeline (obp) >>> from obp.dataset import MultiClassToBanditReduction >>> from obp.ope import OffPolicyEvaluation, InverseProbabilityWeighting as IPW # load raw digits data >>> X, y = load_digits(return_X_y=True) # convert the raw classification data into the logged bandit dataset >>> dataset = MultiClassToBanditReduction( X=X, y=y, base_classifier_b=LogisticRegression(random_state=12345), alpha_b=0.8, dataset_name="digits", ) # split the original data into the training and evaluation sets >>> dataset.split_train_eval(eval_size=0.7, random_state=12345) # obtain logged bandit feedback generated by behavior policy >>> bandit_feedback = dataset.obtain_batch_bandit_feedback(random_state=12345) >>> bandit_feedback { 'n_actions': 10, 'n_rounds': 1258, 'context': array([[ 0., 0., 0., ..., 16., 1., 0.], [ 0., 0., 7., ..., 16., 3., 0.], [ 0., 0., 12., ..., 8., 0., 0.], ..., [ 0., 1., 13., ..., 8., 11., 1.], [ 0., 0., 15., ..., 0., 0., 0.], [ 0., 0., 4., ..., 15., 3., 0.]]), 'action': array([6, 8, 5, ..., 2, 5, 9]), 'reward': array([1., 1., 1., ..., 1., 1., 1.]), 'position': array([0, 0, 0, ..., 0, 0, 0]), 'pscore': array([0.82, 0.82, 0.82, ..., 0.82, 0.82, 0.82]) } # obtain action choice probabilities by an evaluation policy and its ground-truth policy value >>> action_dist = dataset.obtain_action_dist_by_eval_policy( base_classifier_e=LogisticRegression(C=100, random_state=12345), alpha_e=0.9, ) >>> ground_truth = dataset.calc_ground_truth_policy_value(action_dist=action_dist) >>> ground_truth 0.865643879173291 # off-policy evaluation using IPW >>> ope = OffPolicyEvaluation(bandit_feedback=bandit_feedback, ope_estimators=[IPW()]) >>> estimated_policy_value = ope.estimate_policy_values(action_dist=action_dist) >>> estimated_policy_value {'ipw': 0.8662705029276045} # evaluate the estimation performance (accuracy) of IPW by relative estimation error (relative-ee) >>> relative_estimation_errors = ope.evaluate_performance_of_estimators( ground_truth_policy_value=ground_truth, action_dist=action_dist, ) >>> relative_estimation_errors {'ipw': 0.000723881690137968}
References
Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. “Doubly Robust Policy Evaluation and Optimization.”, 2014.
-
calc_ground_truth_policy_value
(action_dist: numpy.ndarray) → numpy.ndarray[source]¶ Calculate the ground-truth policy value of a given action distribution.
- Parameters
action_dist (array-like, shape (n_rounds_ev, n_actions, 1)) – Action distribution or action choice probabilities of a policy whose ground-truth is to be caliculated here. where n_rounds_ev is the number of samples in the evaluation set given the current train-eval split. n_actions is the number of actions. axis 2 of action_dist represents the length of list; it is always 1 in the current implementation.
- Returns
ground_truth_policy_value – policy value of a given action distribution (mostly evaluation policy).
- Return type
float
-
obtain_action_dist_by_eval_policy
(base_classifier_e: Optional[sklearn.base.ClassifierMixin] = None, alpha_e: float = 1.0) → numpy.ndarray[source]¶ Obtain action choice probabilities by an evaluation policy.
- Parameters
base_classifier_e (ClassifierMixin, default=None) – Machine learning classifier used to construct a behavior policy.
alpha_e (float, default=1.0) – Ration of a uniform random policy when constructing an evaluation policy. Must be in the [0, 1] interval (evaluation policy can be deterministic).
- Returns
action_dist_by_eval_policy – action_dist_by_eval_policy is an action choice probabilities by an evaluation policy. where n_rounds_ev is the number of samples in the evaluation set given the current train-eval split. n_actions is the number of actions. axis 2 represents the length of list; it is always 1 in the current implementation.
- Return type
array-like, shape (n_rounds_ev, n_actions, 1)
-
obtain_batch_bandit_feedback
(random_state: Optional[int] = None) → Dict[str, Union[int, numpy.ndarray]][source]¶ Obtain batch logged bandit feedback, an evaluation policy, and its ground-truth policy value.
Note
Please call self.split_train_eval() before calling this method.
- Parameters
eval_size (float or int, default=0.25) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples.
random_state (int, default=None) – Controls the random seed in sampling actions.
- Returns
bandit_feedback – bandit_feedback is logged bandit feedback data generated from a multi-class classification dataset.
- Return type
BanditFeedback
-
split_train_eval
(eval_size: Union[int, float] = 0.25, random_state: Optional[int] = None) → None[source]¶ Split the original data into the training (used for policy learning) and evaluation (used for OPE) sets.
- Parameters
eval_size (float or int, default=0.25) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the evaluation split. If int, represents the absolute number of test samples.
random_state (int, default=None) – Controls the random seed in train-evaluation split.
-
property
len_list
¶ Length of recommendation lists.
-
property
n_actions
¶ Number of actions (number of classes).
-
property
n_rounds
¶ Number of samples in the original multi-class classification data.