Abstract
Reinforcement Learning from Human Feedback (RLHF) has achieved impressive empirical successes while relying on a small amount of human feedback. However, there is limited theoretical justification for this phenomenon. Additionally, most recent studies focus on value-based algorithms despite the recent empirical successes of policy-based algorithms. In this work, we consider an RLHF algorithm based on policy optimization (PO-RLHF). The algorithm is based on the popular Policy Cover-Policy Gradient (PC-PG) algorithm, which assumes knowledge of the reward function. In PO-RLHF, knowledge of the reward function is not assumed, and the algorithm uses trajectory-based comparison feedback to infer the reward function. We provide performance bounds for PO-RLHF with low query complexity, which provides insight into why a small amount of human feedback may be sufficient to achieve good performance with RLHF. A key novelty is a trajectory-level elliptical potential analysis, which bounds the reward estimation error when comparison feedback (rather than numerical reward observation) is given. We provide and analyze algorithms PG-RLHF and NN-PG-RLHF for two settings: linear and neural function approximation, respectively.
Original language | English |
---|---|
Pages (from-to) | 11830-11887 |
Number of pages | 58 |
Journal | Proceedings of Machine Learning Research |
Volume | 235 |
State | Published - 2024 |
Externally published | Yes |
Event | 41st International Conference on Machine Learning, ICML 2024 - Vienna, Austria Duration: 21 Jul 2024 → 27 Jul 2024 |
ASJC Scopus subject areas
- Artificial Intelligence
- Software
- Control and Systems Engineering
- Statistics and Probability