This paper deals with bandit online learning problems involving feedback of unknown delay that can emerge in multi-armed bandit (MAB) and bandit convex optimization (BCO) settings. MAB and BCO require only values of the objective function involved that become available through feedback, and are used to estimate the gradient appearing in the corresponding iterative algorithms. Since the challenging case of feedback with \emph{unknown} delays prevents one from constructing the sought gradient estimates, existing MAB and BCO algorithms become intractable. For such challenging setups, delayed exploration, exploitation, and exponential (DEXP3) iterations, along with delayed bandit gradient descent (DBGD) iterations are developed for MAB and BCO, respectively. Leveraging a unified analysis framework, it is established that the regret of DEXP3 and DBGD are ${\cal O}\big( \sqrt{K\bar{d}(T+D)} \big)$ and ${\cal O}\big( \sqrt{K(T+D)} \big)$, respectively, where $\bar{d}$ is the maximum delay and $D$ denotes the delay accumulated over $T$ slots. Numerical tests using both synthetic and real data validate the performance of DEXP3 and DBGD.
本文研究了多臂老虎机(MAB)和老虎机凸优化(BCO)环境中可能出现的具有未知延迟反馈的老虎机在线学习问题。MAB和BCO仅需要通过反馈获得的目标函数值,并用于估计相应迭代算法中出现的梯度。由于具有\emph{未知}延迟的反馈这一具有挑战性的情况使得人们无法构建所需的梯度估计,现有的MAB和BCO算法变得难以处理。针对此类具有挑战性的设置,分别为MAB和BCO开发了延迟探索、利用和指数(DEXP3)迭代以及延迟老虎机梯度下降(DBGD)迭代。利用统一的分析框架,确定DEXP3和DBGD的遗憾分别为${\cal O}\big( \sqrt{K\bar{d}(T + D)} \big)$和${\cal O}\big( \sqrt{K(T + D)} \big)$,其中$\bar{d}$是最大延迟,$D$表示在$T$个时隙上累积的延迟。使用合成数据和真实数据进行的数值测试验证了DEXP3和DBGD的性能。