Reinforcement Learning and Approximate Dynamic Programming for Feedback Control
Format: PDF / Kindle (mobi) / ePub
Reinforcement learning (RL) and adaptive dynamic programming (ADP) has been one of the most critical research fields in science and engineering for modern complex systems. This book describes the latest RL and ADP techniques for decision and control in human engineered systems, covering both single player decision and control and multi-player games. Edited by the pioneers of RL and ADP research, the book brings together ideas and methods from many fields and provides an important and timely guidance on controlling a wide variety of systems, such as robots, industrial processes, and economic decision-making.
opponents' actions and update their strategies in reaction to others' actions in a best-response fashion. Marden et al.  propose a modified version of the fictitious play called joint fictitious play with inertia for potential games, in which players alternate their updates at different time slots. In all these learning schemes, players have to monitor the actions of every other player and need to know their own payoff so as to find their optimal actions. In this chapter, we are interested in
very similar to the planning trees used here, and planning algorithms based on them were given for example by . 22.5 Numerical Example To understand the behavior of OPD, OLOP, and OPSS in practice, they will be applied to the problem of swinging up an underactuated inverted pendulum—a rather simple problem commonly used in the literature on solving MDPs. The inverted pendulum consists of a mass attached to an actuated link (a rod) that rotates in a vertical plane. The available power is
columns of the feature matrix that correspond to the two smallest components of the weight vector as follows. The column of the feature matrix corresponding to the smallest component of the weight vector is replaced by the most recent estimate of the value function (suitably normalized so that all its entries are between 0 and 1) obtained after running the TD scheme for a given large number of iterations using the feature matrix of the previous step, while the column corresponding to the second
then the ACOE is solved, where η* = hr(x •). The successful approaches to this problem consider not h*, but the function of two variables, (24.41) Watkins introduced a reinforcement learning technique for computation of H* in his thesis, and a complete convergence proof appeared later in . An elementary proof based on an associated “fluid limit model” is contained in . Unfortunately, this approach depends critically on a finite state space, a finite action space, and a complete
cart-pole system from its naïve state after the above four initial failures. Note that different sets of initial weights used in the action and the critic networks, as well as different learning parameters used in the training algorithms, result in different controllers. Simulation results show that, for every run initialized from random weights in the networks, the direct HDP controller had always been able to successfully learn to balance the cart-pole within 50 trials. Figure 9.2 Trajectories