obp.policy.contextfree

Context-Free Bandit Algorithms.

Classes

BernoulliTS(n_actions, len_list, batch_size, …)

Bernoulli Thompson Sampling Policy

EpsilonGreedy(n_actions, len_list, …)

Epsilon Greedy policy.

Random(n_actions, len_list, batch_size, …)

Random policy

class obp.policy.contextfree.BernoulliTS(n_actions: int, len_list: int = 1, batch_size: int = 1, random_state: Optional[int] = None, alpha: Optional[numpy.ndarray] = None, beta: Optional[numpy.ndarray] = None, is_zozotown_prior: bool = False, campaign: Optional[str] = None, policy_name: str = 'bts')[source]

Bases: obp.policy.base.BaseContextFreePolicy

Bernoulli Thompson Sampling Policy

Parameters
  • n_actions (int) – Number of actions.

  • len_list (int, default=1) – Length of a list of actions recommended in each impression. When Open Bandit Dataset is used, 3 should be set.

  • batch_size (int, default=1) – Number of samples used in a batch parameter update.

  • random_state (int, default=None) – Controls the random seed in sampling actions.

  • alpha (array-like, shape (n_actions, ), default=None) – Prior parameter vector for Beta distributions.

  • beta (array-like, shape (n_actions, ), default=None) – Prior parameter vector for Beta distributions.

  • is_zozotown_prior (bool, default=False) – Whether to use hyperparameters for the beta distribution used at the start of the data collection period in ZOZOTOWN.

  • campaign (str, default=None) – One of the three possible campaigns considered in ZOZOTOWN, “all”, “men”, and “women”.

  • policy_name (str, default=’bts’) – Name of bandit policy.

compute_batch_action_dist(n_rounds: int = 1, n_sim: int = 100000) → numpy.ndarray[source]

Compute the distribution over actions by Monte Carlo simulation.

Parameters
  • n_rounds (int, default=1) – Number of rounds in the distribution over actions. (the size of the first axis of action_dist)

  • n_sim (int, default=100000) – Number of simulations in the Monte Carlo simulation to compute the distribution over actions.

Returns

action_dist – Probability estimates of each arm being the best one for each sample, action, and position.

Return type

array-like, shape (n_rounds, n_actions, len_list)

initialize() → None

Initialize Parameters.

select_action() → numpy.ndarray[source]

Select a list of actions.

Returns

selected_actions – List of selected actions.

Return type

array-like, shape (len_list, )

update_params(action: int, reward: float) → None[source]

Update policy parameters.

Parameters
  • action (int) – Selected action by the policy.

  • reward (float) – Observed reward for the chosen action and position.

property policy_type

Type of the bandit policy.

class obp.policy.contextfree.EpsilonGreedy(n_actions: int, len_list: int = 1, batch_size: int = 1, random_state: Optional[int] = None, epsilon: float = 1.0)[source]

Bases: obp.policy.base.BaseContextFreePolicy

Epsilon Greedy policy.

Parameters
  • n_actions (int) – Number of actions.

  • len_list (int, default=1) – Length of a list of actions recommended in each impression. When Open Bandit Dataset is used, 3 should be set.

  • batch_size (int, default=1) – Number of samples used in a batch parameter update.

  • random_state (int, default=None) – Controls the random seed in sampling actions.

  • epsilon (float, default=1.) – Exploration hyperparameter that must take value in the range of [0., 1.].

  • policy_name (str, default=f’egreedy_{epsilon}’.) – Name of bandit policy.

initialize() → None

Initialize Parameters.

select_action() → numpy.ndarray[source]

Select a list of actions.

Returns

selected_actions – List of selected actions.

Return type

array-like, shape (len_list, )

update_params(action: int, reward: float) → None[source]

Update policy parameters.

Parameters
  • action (int) – Selected action by the policy.

  • reward (float) – Observed reward for the chosen action and position.

property policy_type

Type of the bandit policy.

class obp.policy.contextfree.Random(n_actions: int, len_list: int = 1, batch_size: int = 1, random_state: Optional[int] = None, epsilon: float = 1.0, policy_name: str = 'random')[source]

Bases: obp.policy.contextfree.EpsilonGreedy

Random policy

Parameters
  • n_actions (int) – Number of actions.

  • len_list (int, default=1) – Length of a list of actions recommended in each impression. When Open Bandit Dataset is used, 3 should be set.

  • batch_size (int, default=1) – Number of samples used in a batch parameter update.

  • random_state (int, default=None) – Controls the random seed in sampling actions.

  • epsilon (float, default=1.) – Exploration hyperparameter that must take value in the range of [0., 1.].

  • policy_name (str, default=’random’.) – Name of bandit policy.

compute_batch_action_dist(n_rounds: int = 1, n_sim: int = 100000) → numpy.ndarray[source]

Compute the distribution over actions by Monte Carlo simulation.

Parameters

n_rounds (int, default=1) – Number of rounds in the distribution over actions. (the size of the first axis of action_dist)

Returns

action_dist – Probability estimates of each arm being the best one for each sample, action, and position.

Return type

array-like, shape (n_rounds, n_actions, len_list)

initialize() → None

Initialize Parameters.

select_action() → numpy.ndarray

Select a list of actions.

Returns

selected_actions – List of selected actions.

Return type

array-like, shape (len_list, )

update_params(action: int, reward: float) → None

Update policy parameters.

Parameters
  • action (int) – Selected action by the policy.

  • reward (float) – Observed reward for the chosen action and position.

property policy_type

Type of the bandit policy.