TY - GEN
T1 - An Exploration-free Method for a Linear Stochastic Bandit Driven by a Linear Gaussian Dynamical System
AU - Gornet, Jonathan
AU - Mo, Yilin
AU - Sinopoli, Bruno
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - In stochastic multi-armed bandits, a major problem the learner faces is the trade-off between exploration and exploitation. Recently, exploration-free methods - methods that commit to the action predicted to return the highest reward - have been studied from the perspective of linear bandits. In this paper, we introduce a linear bandit setting where the reward is the output of a linear Gaussian dynamical system. Motivated by a problem encountered in hyperparameter optimization for reinforcement learning, where the number of actions is much higher than the number of training iterations, we propose Kalman filter Observability Dependent Exploration (KODE), an exploration-free method that utilizes the Kalman filter predictions to select actions. Our major contribution of this work is our discovery that the performance of the proposed method is dependent on the observability properties of the underlying linear Gaussian dynamical system. We evaluate KODE via two different metrics: regret, which is the cumulative expected difference between the highest possible reward and the reward sampled by KODE, and action alignment, which measures how closely KODE's chosen action aligns with the linear Gaussian dynamical system's state variable. To provide intuition on the performance, we prove that KODE implicitly encourages the learner to explore actions depending on the observability of the linear Gaussian dynamical system. This method is compared to several well-known stochastic multi-armed bandit algorithms to validate our theoretical results.
AB - In stochastic multi-armed bandits, a major problem the learner faces is the trade-off between exploration and exploitation. Recently, exploration-free methods - methods that commit to the action predicted to return the highest reward - have been studied from the perspective of linear bandits. In this paper, we introduce a linear bandit setting where the reward is the output of a linear Gaussian dynamical system. Motivated by a problem encountered in hyperparameter optimization for reinforcement learning, where the number of actions is much higher than the number of training iterations, we propose Kalman filter Observability Dependent Exploration (KODE), an exploration-free method that utilizes the Kalman filter predictions to select actions. Our major contribution of this work is our discovery that the performance of the proposed method is dependent on the observability properties of the underlying linear Gaussian dynamical system. We evaluate KODE via two different metrics: regret, which is the cumulative expected difference between the highest possible reward and the reward sampled by KODE, and action alignment, which measures how closely KODE's chosen action aligns with the linear Gaussian dynamical system's state variable. To provide intuition on the performance, we prove that KODE implicitly encourages the learner to explore actions depending on the observability of the linear Gaussian dynamical system. This method is compared to several well-known stochastic multi-armed bandit algorithms to validate our theoretical results.
UR - https://www.scopus.com/pages/publications/105031887178
U2 - 10.1109/CDC57313.2025.11312563
DO - 10.1109/CDC57313.2025.11312563
M3 - Conference contribution
AN - SCOPUS:105031887178
T3 - Proceedings of the IEEE Conference on Decision and Control
SP - 5493
EP - 5500
BT - 2025 IEEE 64th Conference on Decision and Control, CDC 2025
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 64th IEEE Conference on Decision and Control, CDC 2025
Y2 - 9 December 2025 through 12 December 2025
ER -