obp.dataset.synthetic

Class for Generating Synthetic Logged Bandit Feedback.

Functions

linear_behavior_policy(context, action_context)

Linear contextual behavior policy for synthetic bandit datasets.

linear_reward_function(context, action_context)

Linear mean reward function for synthetic bandit datasets.

logistic_reward_function(context, action_context)

Logistic mean reward function for synthetic bandit datasets.

Classes

SyntheticBanditDataset(n_actions, …)

Class for generating synthetic bandit dataset.

class obp.dataset.synthetic.SyntheticBanditDataset(n_actions: int, dim_context: int = 1, reward_type: str = 'binary', reward_function: Optional[Callable[[numpy.ndarray, numpy.ndarray], numpy.ndarray]] = None, behavior_policy_function: Optional[Callable[[numpy.ndarray, numpy.ndarray], numpy.ndarray]] = None, random_state: Optional[int] = None, dataset_name: str = 'synthetic_bandit_dataset')[source]

Bases: obp.dataset.base.BaseSyntheticBanditDataset

Class for generating synthetic bandit dataset.

Note

By calling the obtain_batch_bandit_feedback method several times, we have different bandit samples with the same setting. This can be used to estimate confidence intervals of the performances of OPE estimators.

If None is set as behavior_policy_function, the synthetic data will be context-free bandit feedback.

Parameters
  • n_actions (int) – Number of actions.

  • dim_context (int, default=1) – Number of dimensions of context vectors.

  • reward_type (str, default=’binary’) – Type of reward variable, must be either ‘binary’ or ‘continuous’. When ‘binary’ is given, rewards are sampled from the Bernoulli distribution. When ‘continuous’ is given, rewards are sampled from the truncated Normal distribution with scale=1.

  • reward_function (Callable[[np.ndarray, np.ndarray], np.ndarray]], default=None) – Function generating expected reward with context and action context vectors, i.e., \(\mu: \mathcal{X} \times \mathcal{A} \rightarrow \mathbb{R}\). If None is set, context independent expected reward for each action will be sampled from the uniform distribution automatically.

  • behavior_policy_function (Callable[[np.ndarray, np.ndarray], np.ndarray], default=None) – Function generating probability distribution over action space, i.e., \(\pi: \mathcal{X} \rightarrow \Delta(\mathcal{A})\). If None is set, context independent uniform distribution will be used (uniform random behavior policy).

  • random_state (int, default=None) – Controls the random seed in sampling synthetic bandit dataset.

  • dataset_name (str, default=’synthetic_bandit_dataset’) – Name of the dataset.

Examples

>>> import numpy as np
>>> from obp.dataset import (
    SyntheticBanditDataset,
    linear_reward_function,
    linear_behavior_policy
)

# generate synthetic contextual bandit feedback with 10 actions.
>>> dataset = SyntheticBanditDataset(
        n_actions=10,
        dim_context=5,
        reward_function=logistic_reward_function,
        behavior_policy=linear_behavior_policy,
        random_state=12345
    )
>>> bandit_feedback = dataset.obtain_batch_bandit_feedback(n_rounds=100000)
>>> bandit_feedback
{
    'n_rounds': 100000,
    'n_actions': 10,
    'context': array([[-0.20470766,  0.47894334, -0.51943872, -0.5557303 ,  1.96578057],
            [ 1.39340583,  0.09290788,  0.28174615,  0.76902257,  1.24643474],
            [ 1.00718936, -1.29622111,  0.27499163,  0.22891288,  1.35291684],
            ...,
            [ 1.36946256,  0.58727761, -0.69296769, -0.27519988, -2.10289159],
            [-0.27428715,  0.52635353,  1.02572168, -0.18486381,  0.72464834],
            [-1.25579833, -1.42455203, -0.26361242,  0.27928604,  1.21015571]]),
    'action_context': array([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
            [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
            [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
            [0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
            [0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
            [0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
            [0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
            [0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
            [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
            [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]]),
    'action': array([7, 4, 0, ..., 7, 9, 6]),
    'position': array([0, 0, 0, ..., 0, 0, 0]),
    'reward': array([0, 1, 1, ..., 0, 1, 0]),
    'expected_reward': array([[0.80210203, 0.73828559, 0.83199558, ..., 0.81190503, 0.70617705,
            0.68985306],
            [0.94119582, 0.93473317, 0.91345213, ..., 0.94140688, 0.93152449,
            0.90132868],
            [0.87248862, 0.67974991, 0.66965669, ..., 0.79229752, 0.82712978,
            0.74923536],
            ...,
            [0.64856003, 0.38145901, 0.84476094, ..., 0.40962057, 0.77114661,
            0.65752798],
            [0.73208527, 0.82012699, 0.78161352, ..., 0.72361416, 0.8652249 ,
            0.82571751],
            [0.40348366, 0.24485417, 0.24037926, ..., 0.49613133, 0.30714854,
            0.5527749 ]]),
    'pscore': array([0.05423855, 0.10339675, 0.09756788, ..., 0.05423855, 0.07250876,
            0.14065505])
}
obtain_batch_bandit_feedback(n_rounds: int) → Dict[str, Union[int, numpy.ndarray]][source]

Obtain batch logged bandit feedback.

Parameters

n_rounds (int) – Number of rounds for synthetic bandit feedback data.

Returns

bandit_feedback – Generated synthetic bandit feedback dataset.

Return type

BanditFeedback

sample_contextfree_expected_reward() → numpy.ndarray[source]

Sample expected reward for each action from the uniform distribution.

property len_list

Length of recommendation lists.

obp.dataset.synthetic.linear_behavior_policy(context: numpy.ndarray, action_context: numpy.ndarray, random_state: Optional[int] = None) → numpy.ndarray[source]

Linear contextual behavior policy for synthetic bandit datasets.

Parameters
  • context (array-like, shape (n_rounds, dim_context)) – Context vectors characterizing each round (such as user information).

  • action_context (array-like, shape (n_actions, dim_action_context)) – Vector representation for each action.

  • random_state (int, default=None) – Controls the random seed in sampling dataset.

Returns

behavior_policy – Action choice probabilities given context (\(x\)), i.e., \(\pi: \mathcal{X} \rightarrow \Delta(\mathcal{A})\).

Return type

array-like, shape (n_rounds, n_actions)

obp.dataset.synthetic.linear_reward_function(context: numpy.ndarray, action_context: numpy.ndarray, random_state: Optional[int] = None) → numpy.ndarray[source]

Linear mean reward function for synthetic bandit datasets.

Parameters
  • context (array-like, shape (n_rounds, dim_context)) – Context vectors characterizing each round (such as user information).

  • action_context (array-like, shape (n_actions, dim_action_context)) – Vector representation for each action.

  • random_state (int, default=None) – Controls the random seed in sampling dataset.

Returns

expected_reward – Expected reward given context (\(x\)) and action (\(a\)), i.e., \(q(x,a):=\mathbb{E}[r|x,a]\).

Return type

array-like, shape (n_rounds, n_actions)

obp.dataset.synthetic.logistic_reward_function(context: numpy.ndarray, action_context: numpy.ndarray, random_state: Optional[int] = None) → numpy.ndarray[source]

Logistic mean reward function for synthetic bandit datasets.

Parameters
  • context (array-like, shape (n_rounds, dim_context)) – Context vectors characterizing each round (such as user information).

  • action_context (array-like, shape (n_actions, dim_action_context)) – Vector representation for each action.

  • random_state (int, default=None) – Controls the random seed in sampling dataset.

Returns

expected_reward – Expected reward given context (\(x\)) and action (\(a\)), i.e., \(q(x,a):=\mathbb{E}[r|x,a]\).

Return type

array-like, shape (n_rounds, n_actions)