obp.policy.linear¶

Contextual Linear Bandit Algorithms.

Classes

`LinEpsilonGreedy`(dim, n_actions, len_list, …)	Linear Epsilon Greedy.
`LinTS`(dim, n_actions, len_list, batch_size, …)	Linear Thompson Sampling.
`LinUCB`(dim, n_actions, len_list, batch_size, …)	Linear Upper Confidence Bound.

class obp.policy.linear.LinEpsilonGreedy(dim: int, n_actions: int, len_list: int = 1, batch_size: int = 1, alpha_: float = 1.0, lambda_: float = 1.0, random_state: Optional[int] = None, epsilon: float = 0.0)[source]¶

Bases: obp.policy.base.BaseContextualPolicy

Linear Epsilon Greedy.

Parameters

dim (int) – Number of dimensions of context vectors.
n_actions (int) – Number of actions.
len_list (int, default=1) – Length of a list of actions recommended in each impression. When Open Bandit Dataset is used, 3 should be set.
batch_size (int, default=1) – Number of samples used in a batch parameter update.
n_trial (int, default=0) – Current number of trials in a bandit simulation.
random_state (int, default=None) – Controls the random seed in sampling actions.
epsilon (float, default=0.) – Exploration hyperparameter that must take value in the range of [0., 1.].

References

L. Li, W. Chu, J. Langford, and E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, pp. 661–670. ACM, 2010.

initialize() → None¶: Initialize policy parameters.

select_action(context: numpy.ndarray) → numpy.ndarray[source]¶

Select action for new data.

Parameters: context (array-like, shape (1, dim_context)) – Observed context vector.
Returns: selected_actions – List of selected actions.
Return type: array-like, shape (len_list, )

update_params(action: int, reward: float, context: numpy.ndarray) → None[source]¶

Update policy parameters.

Parameters

action (int) – Selected action by the policy.
reward (float) – Observed reward for the chosen action and position.
context (array-like, shape (1, dim_context)) – Observed context vector.

property policy_type¶: Type of the bandit policy.

class obp.policy.linear.LinTS(dim: int, n_actions: int, len_list: int = 1, batch_size: int = 1, alpha_: float = 1.0, lambda_: float = 1.0, random_state: Optional[int] = None)[source]¶

Bases: obp.policy.base.BaseContextualPolicy

Linear Thompson Sampling.

Parameters

dim (int) – Number of dimensions of context vectors.
n_actions (int) – Number of actions.
len_list (int, default=1) – Length of a list of actions recommended in each impression. When Open Bandit Dataset is used, 3 should be set.
batch_size (int, default=1) – Number of samples used in a batch parameter update.
alpha_ (float, default=1.) – Prior parameter for the online logistic regression.
random_state (int, default=None) – Controls the random seed in sampling actions.

initialize() → None¶: Initialize policy parameters.

select_action(context: numpy.ndarray) → numpy.ndarray[source]¶

Select action for new data.

Parameters: context (array-like, shape (1, dim_context)) – Observed context vector.
Returns: selected_actions – List of selected actions.
Return type: array-like, shape (len_list, )

update_params(action: int, reward: float, context: numpy.ndarray) → None[source]¶

Update policy parameters.

Parameters

action (int) – Selected action by the policy.
reward (float) – Observed reward for the chosen action and position.
context (array-like, shape (1, dim_context)) – Observed context vector.

property policy_type¶: Type of the bandit policy.

class obp.policy.linear.LinUCB(dim: int, n_actions: int, len_list: int = 1, batch_size: int = 1, alpha_: float = 1.0, lambda_: float = 1.0, random_state: Optional[int] = None, epsilon: float = 0.0)[source]¶

Bases: obp.policy.base.BaseContextualPolicy

Linear Upper Confidence Bound.

Parameters

dim (int) – Number of dimensions of context vectors.
n_actions (int) – Number of actions.
len_list (int, default=1) – Length of a list of actions recommended in each impression. When Open Bandit Dataset is used, 3 should be set.
batch_size (int, default=1) – Number of samples used in a batch parameter update.
random_state (int, default=None) – Controls the random seed in sampling actions.
epsilon (float, default=0.) – Exploration hyperparameter that must take value in the range of [0., 1.].

References

L. Li, W. Chu, J. Langford, and E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, pp. 661–670. ACM, 2010.

initialize() → None¶: Initialize policy parameters.

select_action(context: numpy.ndarray) → numpy.ndarray[source]¶

Select action for new data.

Parameters: context (array) – Observed context vector.
Returns: selected_actions – List of selected actions.
Return type: array-like, shape (len_list, )

update_params(action: int, reward: float, context: numpy.ndarray) → None[source]¶

Update policy parameters.

Parameters

action (int) – Selected action by the policy.
reward (float) – Observed reward for the chosen action and position.
context (array-like, shape (1, dim_context)) – Observed context vector.

property policy_type¶: Type of the bandit policy.