OpenAI Gymnasium is a diverse collection of reference environments that one can use for simulating and training an agent. It is a standard API that can be used to train a variety of standard agent-goal problems using a predefined control policy.
Cartpole controller is the classic control problem that has been described in Bartol et al (1983). A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The pendulum is placed upright on the cart and the goal is to balance the pole by applying forces in the left and right direction on the cart.
The environment itself defines the problem using an action-space, observation-space pair.
The action-space is given by 2 inputs:
1) Push the cart to the left
2) Push the cart to the right
The observation-space is given by 4 outputs:
1) Cart Position
2) Cart Velocity
3) Pole Angle
4) Pole Angular Velocity
Each of these have specific upper and lower limits and these can be summarized in the table:
The goal of this project is to make a reinforcement learning based controller for the gymnasium cartpole environment that can successfully solve the end objective: Keep the pole upright for 500 time steps. (Get a score of 500)
Score for the environment is measured by the time steps for which the pole was upright. The objective is to maximize this score.
I decided to approach this problem by training a reinforcement learning controller that can learn as per a specific policy which either rewards or penalizes the agent based on pre set conditions.
In the initial few iterations, I randomized the controller input to check how much score I get for randomized input values. If the score was above a certain threshold value, I stored it in a separate list. The observation-action sequence for these scores was also stored in memory. These initial seeds give us supposedly good observation-action sequences that can lead to high scores.
The first random seed of iterations gave me a median score of 28 time steps. I stored the observation-action sequence for all above my threshold value of 20. This was to be used as training data for the model to train on.
Then I used tensorflow's TFLearn library to make a deep-learning neural network that could learn the mapping from the observations to the actions. This was modeled as a dense network of linear layers. The input layer was observations with 4 inputs. Output was actions layer with 2 outputs. In between there were 5 layers of 128, 256, 512, 256, 128 neurons respectively.
ReLu was the activation function for all layers except for the layer which used a soft-max activation. Categorical cross entropy loss was used as a cost function.
After the model was trained completely, the model was used to predict what actions to take for the agent. The agent performed much better than the previous iteration getting scores above 200 consistently.
However, this was still not upto the 500 mark expected for the agent. By tuning the threshold values, regenerating seeds from the new trained model iterations, the agent was retrained on this new training data. This helped achieve much better results.