- Sutton and Barto's book "Reinforcement Learning: An Introduction"
- Project 7 in the Georgia Tech Spring 2020 course Machine Learning for Trading by Prof. Tucker Balch.
- The code has been written and tested in Python 3.7.7.
- Q-learning implementation for reinforcement learning.
- Options: basic Q-learning, Dyna-Q (for model planning), double Q-learning (to avoid maximization bias).
- Dyna-Q has been implemented with both a deterministic model and a probabilistic model.
- The deterministic model and probabilistic model have both two versions, one using dictionaries (less memory but slower) and one using arrays (more memory but faster).
- Double Q-learning can be used with basic Q-learning as well as with Dyna-Q.
- The Q-learning class in QLearner.py can be used for any reinforcement learning problem, while robot.py and test.py are specific for a grid-world type problem (i.e. finding the best policy to go from a start point to a goal point).
- Note: states must be unique integers in the interval
, actions must be unique integers in the interval(0,num_actions)
, and all states must have all the actions. - Usage: python test.py csv-filename.
File name with the map layout passed as argument. It must be in a csv file, with the map elements specified using integer numbers.
List of elements allowed in the map layout.
List of rewards associated to each element in map_elements
List of allowed moves for the robot (see also an example of an 8-way robot in test.py).
Number of episodes (each episode is a trip from start to goal)
Maximum number of steps allowed to reach the goal (for each episode).
Probability the robot will move randomly instead to move as required.
Learning rate (used to vary the weight given to new experiences compared with past Q-values).
Discount factor (used to progressively reduce the value of future rewards).
Probability of selecting a random action instead of using the action derived from the Q-table(s) (i.e. probability to explore).
Rate decay for the probability to explore (used to reduce the probability to explore with time).
Number of simulated updates in Dyna-Q (when equal to zero Dyna-Q is not used).
Type of model used for the simulation in Dyna-Q (1-2 are deterministic models, 3-4 are probabilistic models).
Specifies if double Q-learning is used (to avoid maximization bias).
All examples are for the map layout in map.csv
. All initial data are as in test.py except when differently specified.
- Basic Q-learning, episodes = 1000, dyna = 0
REWARDS: mean = -63.1, median = -32.0, std = 109.8
STEPS: mean = 62.1, median = 34.0, std = 96.3
Number of updates done: 62085
# # # # # # # # # # # # # # #
# #
# S ~ ~ #
# . # # # # #
# . . . . # G #
# . . # . #
# . # # # # # # . . #
# . # . #
# . . . . . # . . #
# # # . . . #
# # # # #
# # # #
# # # # # # # # # # # # # # #
BEST PATH: rewards = -22.0, Steps = 24.0
- Double Q-learning, episodes = 1000, dyna = 0
REWARDS: mean = -85.0, median = -40.0, std = 132.7
STEPS: mean = 85.5, median = 42.0, std = 130.5
Number of updates done: 85473
# # # # # # # # # # # # # # #
# #
# S ~ ~ #
# . # # # # #
# . # G #
# . # . #
# . # # # # # # . #
# . # . . . #
# . . . . . . . # . #
# # # . . . . . #
# # # # #
# # # #
# # # # # # # # # # # # # # #
BEST PATH: rewards = -22.0, Steps = 24.0
- Double Q-learning, episodes = 50, dyna = 200, model_type = 1
REWARDS: mean = -70.7, median = -28.0, std = 158.5
STEPS: mean = 52.9, median = 30.0, std = 93.5
Number of updates done: 531243
# # # # # # # # # # # # # # #
# #
# S . . . . ~ ~ #
# . # # # # #
# . # G #
# . # . #
# . # # # # # # . #
# . # . . #
# . . . # . . #
# # # . . . . . #
# # # # #
# # # #
# # # # # # # # # # # # # # #
BEST PATH: rewards = -22.0, Steps = 24.0
- Basic Q-learning, episodes = 50, dyna = 200, model_type = 4
REWARDS: mean = -92.7, median = -42.5, std = 183.9
STEPS: mean = 76.9, median = 44.5, std = 94.5
Number of updates done: 567340
Number of updates skipped: 205103
# # # # # # # # # # # # # # #
# #
# S ~ ~ #
# . # # # # #
# . # . G #
# . . # . #
# . # # # # # # . #
# . . . . . . . # . . #
# . # . #
# # # . . . . #
# # # # #
# # # #
# # # # # # # # # # # # # # #
BEST PATH: rewards = -22.0, Steps = 24.0
- Basic Q-learning, episodes = 1000, dyna = 0, using an 8-way robot
REWARDS: mean = -66.6, median = -25.0, std = 120.9
STEPS: mean = 63.3, median = 27.0, std = 100.1
Number of updates done: 63261
# # # # # # # # # # # # # # #
# #
# S ~ ~ #
# . # # # # #
# . # G #
# . # . #
# . # # # # # # . . #
# . # . #
# . . # . #
# # # . . #
# # # # #
# # # #
# # # # # # # # # # # # # # #
BEST PATH: rewards = -13.0, Steps = 15.0