Soft Actor Critic Training

Note

Type: Closed loop, learning based, model free

State/action space constraints: None

Optimal: Yes

Versatility: Swing-up and stabilization

Theory

The soft actor critic (SAC) algorithm is a reinforcement learning (RL) method. It belongs to the class of so called ‘model free’ methods, i.e. no knowledge about the system to be controlled is assumed. Instead, the controller is trained via interaction with the system, such that a (sub-)optimal mapping from state space to control command is learned. The learning process is guided by a reward function that encodes the task, similar to the usage of cost functions in optimal control.

SAC has two defining features. Firstly, the mapping from state space to control command is probabilistic. Secondly, the entropy of the contol output is maximized along with the reward function during training. In theory, this leads to robust controllers and reduces the probability of ending up in suboptimal local minima.

For more information on SAC please refer to the original paper [1]:

API

The sac trainer can be initialized with:

trainer = sac_trainer(log_dir="log_data/sac_training")

During and after the training process the trainer will save logging data as well the best model in the directory provided via the log_dir parameter.

Before the training can be started the following three initialisations have to be made:

trainer.init_pendulum() trainer.init_environment() trainer.init_agent()

Warning

Attention Make sure to call these functions in this order as they build upon another.

The parameters of init_pendulum are:

  • mass: mass of the pendulum

  • length: length of the pendulum

  • inertia: inertia of the pendulum

  • damping: damping of the pendulum

  • coulomb_friction: coulomb_friction of the pendulum

  • gravity: gravity

  • torque_limit: torque limit of the pendulum

The parameters of init_environment are:

  • dt: timestep in seconds

  • integrator: which integrator to use (euler or runge_kutta)

  • max_steps: maximum number of timesteps the agent can take in one episode

  • reward_type: Type of reward to use receive from the environment (continuous, discrete, soft_binary, soft_binary_with_repellor and open_ai_gym)

  • target: the target state ([np.pi, 0] for swingup)

  • state_target_epsilon: the region around the target which is considered as target

  • random_init: How the pendulum is initialized in the beginning of each episode (False, start_vicinity, everywhere)

  • state_representation: How to represent the state of the pendulum (should be 2 or 3). 2 for regular representation (position, velocity). 3 for trigonometric representation (cos(position), sin(position), velocity).

The parameters of init_agent are:

  • learning_rate: learning_rate of the agent

  • warm_start: whether to warm_start the agent

  • warm_start_path: path to a model to load for warm starting

  • verbose: Whether to print training information to the terminal

After these initialisations the training can be started with:

trainer.train(training_timesteps=1e6,
              reward_threshold=1000,
              eval_frequency=10000,
              n_eval_episodes=20,
              verbose=1)

where the parameters are:

  • training_timesteps: Number of timesteps to train

  • reward_threshold: Validation threshold to stop training early

  • eval_frequency: evaluate the model every eval_frequency timesteps

  • n_eval_episodes: number of evaluation episodes

  • verbose: Whether to print training information to the terminal

When finished the train method will save the best model in the log_dir of the trainer object.

Warning

Attention: when training is started, the log_dir will be deleted. So if you want to keep a trained model, move the saved files somewhere else.

The training progress can be observed with tensorboard. Start a new terminal and start the tensorboard with the correct path, e.g.:

$> tensorboard --logdir log_data/sac_training/tb_logs

The default reward function used during training is soft_binary_with_repellor

\[r = \exp{-(\theta - \pi)^2/(2*0.25^2)} - \exp{-(\theta - 0)^2/(2*0.25^2)}\]

This encourages moving away from the stable fixed point of the system at \(\theta = 0\) and spending most time at the target, the unstable fixed point \(\theta = \pi\). Different reward functions can be used. Novel reward funcitons can be implemented by modifying the swingup_reward method of the training environment with an appropriate if clause, and then selecting this reward function in the init_environment parameters under the key ‘reward_type’. The training enviroment is located in gym_environment

Usage

For an example of how to train a sac model see the train_sac.py script in the examples folder.

The trained model can be used with the sac controller.

Comments

Todo: comments on training convergence stability

Requirements

References

[1] [Haarnoja, Tuomas, et al. “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor.” International conference on machine learning. PMLR, 2018.](https://arxiv.org/abs/1801.01290)