Soft Actor Critic Training
Note
Type: Closed loop, learning based, model free
State/action space constraints: None
Optimal: Yes
Versatility: Swing-up and stabilization
Theory
The soft actor critic (SAC) algorithm is a reinforcement learning (RL) method. It belongs to the class of so called ‘model free’ methods, i.e. no knowledge about the system to be controlled is assumed. Instead, the controller is trained via interaction with the system, such that a (sub-)optimal mapping from state space to control command is learned. The learning process is guided by a reward function that encodes the task, similar to the usage of cost functions in optimal control.
SAC has two defining features. Firstly, the mapping from state space to control command is probabilistic. Secondly, the entropy of the contol output is maximized along with the reward function during training. In theory, this leads to robust controllers and reduces the probability of ending up in suboptimal local minima.
For more information on SAC please refer to the original paper [1]:
API
The sac trainer can be initialized with:
trainer = sac_trainer(log_dir="log_data/sac_training")
During and after the training process the trainer will save logging data as well the best model in the directory provided via the log_dir
parameter.
Before the training can be started the following three initialisations have to be made:
trainer.init_pendulum() trainer.init_environment() trainer.init_agent()
Warning
Attention Make sure to call these functions in this order as they build upon another.
The parameters of init_pendulum
are:
mass
: mass of the pendulumlength
: length of the penduluminertia
: inertia of the pendulumdamping
: damping of the pendulumcoulomb_friction
: coulomb_friction of the pendulumgravity
: gravitytorque_limit
: torque limit of the pendulum
The parameters of init_environment
are:
dt
: timestep in secondsintegrator
: which integrator to use (euler
orrunge_kutta
)max_steps
: maximum number of timesteps the agent can take in one episodereward_type
: Type of reward to use receive from the environment (continuous
,discrete
,soft_binary
,soft_binary_with_repellor
andopen_ai_gym
)target
: the target state ([np.pi, 0] for swingup)state_target_epsilon
: the region around the target which is considered as targetrandom_init
: How the pendulum is initialized in the beginning of each episode (False
,start_vicinity
,everywhere
)state_representation
: How to represent the state of the pendulum (should be 2 or 3). 2 for regular representation (position, velocity). 3 for trigonometric representation (cos(position), sin(position), velocity).
The parameters of init_agent
are:
learning_rate
: learning_rate of the agentwarm_start
: whether to warm_start the agentwarm_start_path
: path to a model to load for warm startingverbose
: Whether to print training information to the terminal
After these initialisations the training can be started with:
trainer.train(training_timesteps=1e6,
reward_threshold=1000,
eval_frequency=10000,
n_eval_episodes=20,
verbose=1)
where the parameters are:
training_timesteps
: Number of timesteps to trainreward_threshold
: Validation threshold to stop training earlyeval_frequency
: evaluate the model everyeval_frequency
timestepsn_eval_episodes
: number of evaluation episodesverbose
: Whether to print training information to the terminal
When finished the train method will save the best model in the log_dir
of the trainer object.
Warning
Attention: when training is started, the log_dir will be deleted. So if you want to keep a trained model, move the saved files somewhere else.
The training progress can be observed with tensorboard. Start a new terminal and start the tensorboard with the correct path, e.g.:
$> tensorboard --logdir log_data/sac_training/tb_logs
The default reward function used during training is soft_binary_with_repellor
This encourages moving away from the stable fixed point of the system at \(\theta = 0\) and spending most time at the target, the unstable fixed point \(\theta = \pi\). Different reward functions can be used. Novel reward funcitons can be implemented by modifying the swingup_reward method of the training environment with an appropriate if clause, and then selecting this reward function in the init_environment parameters under the key ‘reward_type’. The training enviroment is located in gym_environment
Usage
For an example of how to train a sac model see the train_sac.py script in the examples folder.
The trained model can be used with the sac controller.
Requirements
Stable Baselines 3 (https://github.com/DLR-RM/stable-baselines3)
Numpy
PyYaml
References
[1] [Haarnoja, Tuomas, et al. “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor.” International conference on machine learning. PMLR, 2018.](https://arxiv.org/abs/1801.01290)
Comments
Todo: comments on training convergence stability