Deep Deterministic Policy Gradient Training
Note
Type: Closed loop, learning based, model free
State/action space constraints: None
Optimal: Yes
Versatility: Swing-up and stabilization
Theory
The Deep deterministic Policy Gradient (DDPG) algorithm is a reinforcement learning (RL) method. DDPG is a model-free off-policy algorithm. The controller is trained via interaction with the system, such that a (sub-)optimal mapping from state space to control command is learned. The learning process is guided by a reward function that encodes the task, similar to the usage of cost functions in optimal control.
DDPG can thought of Q-learning for continuous action spaces. It utilizes two networks, an actor and a critic. The actor receives a state and proposes an action. The critic assigns a value to a state action pair. The better an action suits a state the higher the value.
Further, DDPG makes use of target networks in order to stabilize the training. This means there are a training and a target version of the actor and critic models. The training version is used during training and the target networks are partially updated by polyak averaging:
where \(\tau\) is usually small.
DDPG also makes use of a replay buffer, which is a set of experiences which have been observed during training. The replay buffer should be large enough to contain a wide range of experiences. For more information on DDPG please refer to the original paper [1]:
This implementation losely follows the keras guide [2].
API
The ddpg trainer can be initialized with:
trainer = ddpg_trainer(batch_size=64,
validate_every=20,
validation_reps=10,
train_every_steps=1)
where the parameters are
batch_size
: number of samples to train on in one training stepvalidate_every
: evaluate the training progress every validate_every episodesvalidation_reps
: number of episodes used during the evaluationtrain_every_steps
: frequency of training compared to taking steps in the environment
Before the training can be started the following three initialisations have to be made:
trainer.init_pendulum()
trainer.init_environment()
trainer.init_agent()
Warning
Attention: Make sure to call these functions in this order as they build upon another.
The parameters of init_pendulum
are:
mass
: mass of the pendulumlength
: length of the penduluminertia
: inertia of the pendulumdamping
: damping of the pendulumcoulomb_friction
: coulomb_friction of the pendulumgravity
: gravitytorque_limit
: torque limit of the pendulum
The parameters of init_environment
are:
dt
: timestep in secondsintegrator
: which integrator to use (“euler” or “runge_kutta”)max_steps
: maximum number of timesteps the agent can take in one episodereward_type
: Type of reward to use receive from the environment (“continuous”, “discrete”, “soft_binary”, soft_binary_with_repellor” and “open_ai_gym”)target
: the target state ([np.pi, 0] for swingup)state_target_epsilon
: the region around the target which is considered as targetrandom_init
: How the pendulum is initialized in the beginning of each episode (“False”, “start_vicinity”, “everywhere”)state_representation
: How to represent the state of the pendulum (should be 2 or 3). 2 for regular representation (position, velocity). 3 for trigonometric representation (cos(position), sin(position), velocity).validation_limit
: Validation threshold to stop training early
The parameters of init_agent
are:
replay_buffer_size
: The size of the replay_bufferactor
: the tensorflow model to use as actor during training. If None, the default model is usedcritic
: the tensorflow model to use as actor during training. If None, the default model is useddiscount
: The discount factor to propagate the reward during one episodeactor_lr
: learning rate for the actor modelcritic_lr
: learning rate for the critic modeltau
: determines how much of the training models is copied to the target models
After these initialisations the training can be started with:
trainer.train(n_episodes=1000,
verbose=True)
where the parameters are:
n_episodes
: number of episodes to trainverbose
: Whether to print training information to the terminal
Afterwards the trained model can be saved with:
trainer.save("log_data/ddpg_training")
The save method will save the actor and critic model under the given path.
Warning
Attention: This will delete existiong models at this location. So if you want to keep a trained model, move the saved files somewhere else.
The default reward function used during training is open_ai_gym
This encourages spending most time at the target (in the equation ([pi, 0]) with actions as small as possible. Different reward functions can be used by changing the reward type in init_environment. Novel reward functions can be implemented by modifying the swingup_reward method of the training environment with an appropriate if clause, and then selecting this reward function in the init_environment parameters under the key ‘reward_type’. The training enviroment is located in gym_environment
Usage
For an example of how to train a sac model see the train_ddpg.py script in the examples folder.
The trained model can be used with the ddpg controller.
Requirements
Tensorflow 2.x
References
[1] Lillicrap, Timothy P., et al. “Continuous control with deep reinforcement learning.” arXiv preprint arXiv:1509.02971 (2015).
[2] reras guide
Comments
Todo: comments on training convergence stability