SamSuka
Yosh
Yosh

patreon


About Action Frequency in Reinforcement Learning (Bonus 3/5)

In the A01 video, I explained that my AI can take 20 actions per second, using only a limited set of steering values. This is quite restrictive, since Trackmania allows up to 100 actions per second, and an almost continuous steering range. So yes, in theory, it's quite obvious the AI could reach higher performance without these limits...

So why impose them? Here is a more detailed explaination below! And to start, let’s first look at the topic of action frequency.

In reinforcement learning (RL), an agent (in this case, my AI) interacts with its environment (Trackmania) in discrete time steps. At each time step, the AI gets a reward that depends on the action it took in the previous step. The challenge for the AI is to predict which action will maximize rewards in the future. Not just immediately, but over the course of an entire run.
In other words, the AI has to figure out how a short term decision (like holding a specific steering angle for one time-step) will affect long-term success (the total rewards it accumulates over the next few seconds). That’s the tricky part.

The difficulty of this prediction is closely tied to the time-step duration (which determines the action frequency). If we make the time-steps shorter, each individual action has a lower impact on the long-term outcome. As a result, it becomes harder for the AI to predict the benefit of a single action.

To make things easier, we can adjust another parameter: the time horizon. In RL, the agent doesn’t try to predict consequences infinitely far into the future. Instead, it looks ahead only a limited number of time-steps. (More precisely, RL uses a parameter γ to compute what’s called a discounted cumulative reward. Roughly speaking, 1/(1–γ) corresponds to the effective time horizon.)
A shorter time horizon makes each action appear more significant, which helps the AI predict the best move for a given step. But with a too short horizon, the agent becomes unable to take long-term into account, which can very problematic in a game like Trackmania.

To sum up, there’s a trade-off between the AI’s potential performance and the practicality of training it. This balance depends on the ratio between time-step duration and time horizon.

With short time-steps and a long horizon, the AI can control the car with high precision while also taking long-term consequences into account. The common downside is that each individual action has only a tiny influence over the whole horizon, which makes predictions extremely difficult, and training becomes much harder in practice.
On the other hand, if we increase the time-step duration (lowering action frequency) and/or shorten the horizon, the predictions become simpler and training is easier. But this comes at a cost: the AI loses precise control of the car and/or the ability to plan for the long term.

This is a pretty common challenge in RL. In practice, training tends to get unstable and produces bad results if the time-step duration / time-horizon ratio gets too low. I couldn't find a solid workaround yet.
The only time I managed to train the AI successfully at 100 actions per second was in my video about the noseboost trick (https://www.youtube.com/watch?v=NUl6QikjR04). In that case, a very short time-step (0.01s) was mandatory for the trick to work. The good news is that during noseboosts, everything happens so fast that the AI only needs to plan a couple of seconds ahead: 1-3 seconds is enough. So I was able to reduce the time-horizon a lot, which made it possible to increase the action frequency to the maximum.
But on A01 it's different: it's all about gradually building up speed to save time in the long run. So I couldn't decrease the time-horizon too much, and as a result I had to cap the AI at 20 actions per second to get good results.

On this subject, the biggest problem on A01 was to deal with the brake. To initiate each speed-drift, the car needs a quick brake tap. The optimal strategy is to brake for as little time as possible (0.01s) so you don't lose too much speed. But with a time-step of 0.05s, the AI is forced to brake for at least 0.05s, which is bad.
To fix this, I added an additional choice to the set of actions from which the AI can choose. When the AI wants to brake, it can now choose between braking for the full time-step, or braking for only 0.01s (in which case the brake is automatically released on the next 0.01s frame).
This works better, but it’s still not perfect: the timing of the tap is still restricted. For example, in the downhill section, the TAS makes a 0.01s brake tap exactly at 7.82s into the run. Since the AI can only choose actions every 0.05s, it has to do it at either 7.80s or 7.85, which might be suboptimal. Here again, I could keep giving the AI more finely-tuned action options to get around these limits. But that leads to a new problem: the size of the action space.

In RL, the action space is the set of all actions the AI can choose from at each time-step. In Trackmania, that might include things like “accelerate,” “accelerate + full right,” “accelerate + full left + brake,” etc.
Here again, increasing the size of the action space is not always a good idea, and there are several few important things to consider when designing that.

But that’s a topic for another time though... I’ll dive into it in the next post :)


More Creators