SamSuka
Yosh
Yosh

patreon


About Action Space and Continuous Actions in Reinforcement Learning (Bonus 4/5)

In my previous post, I explained why I capped my AI to 20 actions per second in the A01 video. But action frequency wasn’t the only restriction: I also limited the AI to a fixed set of steering values. This might seem strange, since Trackmania allows for an almost continuous steering range, and many top players use a gamepad joystick to take advantage of that. So again, why this restriction? Here are some further explanations below.

When training my AI with Reinforcement Learning (RL), I'm currently using a discrete action space. This means that a each time step, the AI chooses from a finite list of possible actions, called the action space. For example in Trackmania, this action space might include "press nothing", “accelerate,” “accelerate + full right,” “accelerate + full left + brake,”, "brake" etc.

Accelerating and braking are binary in Trackmania: they’re either fully on or fully off. Steering, on the other hand, depends on the input device. With a keyboard, you can only steer fully right (steer = 1), fully left (steer = –1), or not steer at all (steer = 0). That gives us 2 brake options × 2 accelerate options × 3 steering options = 12 possible actions.

But with a joystick, steering is almost continuous, since Trackmania supports 131,073 different steering values in the range [–1., 1.]. Combined with the brake and accelerate options, it results in a total of 2 × 2 × 131,073 = 524,292 possible actions.

The problem is that in practice, RL tends to struggle with such large discrete action spaces. The AI has to predict which action is best at every time-step, and this gets harder as the number of possibilities explodes. Moreover, most of these actions are extremely similar, which makes training even more confusing.

For this reason, in the A01 video I limited the AI to just 13 steering values:
[–1.0, –0.875, –0.75, –0.50, –0.25, –0.125, 0.0, 0.125, 0.25, 0.50, 0.75, 0.875, 1.0].
That gives an action space of 2 × 2 × 13 = 52 actions, which I found to be a good trade-off between driving precision and training feasibility.

However, I still faced another problem with braking. To initiate speed-drifts optimally, the AI needs very short brake taps of 0.01s (one frame). But as I explained in my previous post, the AI can only choose an action every 0.05s. So with the action space above, it would always brake for 0.05s, which is not optimal.

To fix this, I added an additional braking option: the AI can now brake for 0.05s, brake for 0.01s, or not brake at all. This increases the action space to 2 × 3 × 13 = 78 actions.
You can easily see the issue here: even small tweaks to the action rules can quickly increase the size of the action space. From what I’ve read, an action space of size 78 is already getting quite large and could cause instability during RL training.

So using a discrete action space might not be the best option for Trackmania. Instead, it might be better to use a continuous action space.


Currently, I’m using a RL algorithm based on Deep Q-Learning (DQN), which only supports discrete actions. But in the past, I’ve also used algorithms like Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC), which are designed for continuous actions. For example, in my pipe video, the AI used continuous steering values. But for some reasons (which I tried to analyse in this post), DQN-based methods have worked better for me in Trackmania overall.

With continuous actions (e.g. SAC, PPO), the AI predicts a value directly in the range [–1., 1.] for steering, instead of choosing from a fixed set. This is quite different from DQN, which assigns a “score” (expected cumulative reward) to each discrete steering option and picks the best one. But predicting a single value can be problematic in some cases.
Imagine the car is heading straight for an obstacle. The two best actions might be “steer full left” (–1.0) or “steer full right” (+1.0), while “don’t steer” (0.) is very bad. In this case, what value should be predicted? With a continuous algorithm, the AI might end up predicting the average between the two good options, that is steer = 0., which is actually the worst possible choice.

Another issue: in Trackmania, accelerating and braking are strictly binary. They can’t be continuous. To work around this with PPO or SAC, I treated them as continuous anyway and then converted them to binary: if the predicted value was above 0, the AI would brake; if below 0, it wouldn’t. The problem is that this design isn’t smooth: the AI can’t tell the difference between brake = 0.01 and brake = 0.99, but it sees a huge difference between brake = 0.01 and brake = –0.01, which might be not ideal.

For these reasons, the most optimal solution for Trackmania might be a mixed action space: continuous for steering, discrete for brake and accelerate. Unfortunately, this kind of setup isn’t very common in RL, and it’s trickier to implement.

Anyway, I hope you found this post interesting! Sorry if there are some inaccuracies or grey areas in my explanations. Next time, I’ll be back with a less technical bonus post about the A01 video :)


Comments

Thanks for shedding more light on the approach/details. I'm toying with navigation/pathfinding in 3d spaces via RL and it's nice to know how far you can potentially take DQN.

Dawid Pogorzelski

Very interesting!

aro


More Creators