- Introduction
- Environment
- Action Space
- State Space
- Reward Function
- Deep Q-Learning Network (DQN) Algorithm
- Usage
- Results
This project explores the application of Deep Reinforcement Learning in the context of autonomous driving. The agents are trained using the DQN
algorithm on different driving environments (merge-v0
and highway-fast-v0
). The project includes training with state representations and CNN-based observations, along with transfer learning techniques to enhance agent performance.
We utilize the HighwayEnv environment for this project. HighwayEnv is a highly configurable simulator for highway driving scenarios, providing realistic settings for testing various autonomous driving tasks. It supports multiple driving scenarios and offers a flexible API for integration with reinforcement learning frameworks.
The environments used for training and evaluating the agents are:
- merge-v0: Simulates a highway merging scenario.
- highway-fast-v0: Simulates high-speed highway driving.
The action space defines the set of possible actions that the agent (autonomous vehicle) can take. In HighwayEnv, the action space consists of 5 discrete meta-actions:
- 0: Lane-Left: Move the vehicle one lane to the left.
- 1: IDLE: Maintain the current lane and speed.
- 2: Lane-Right: Move the vehicle one lane to the right.
- 3: Faster: Increase the vehicle's speed.
- 4: Slower: Decrease the vehicle's speed.
The DiscreteMetaAction
type adds a layer of speed and steering controllers on top of the continuous low-level control, so that the ego-vehicle can automatically follow the road at a desired velocity. Then, the available meta-actions consist in changing the target lane and speed that are used as setpoints for the low-level controllers.
The state space represents the current situation of the environment, which the agent uses to make decisions. In HighwayEnv, each state is a
Presence | Vehicle | x | y | vx | vy |
1 | ego-vehicle | 0.05 | 0.04 | 0.75 | 0 |
1 | vehicle 1 | -0.1 | 0.04 | 0.6 | 0 |
1 | vehicle 2 | 0.13 | 0.08 | 0.675 | 0 |
... | ... | ... | ... | ... | ... |
1 | vehicle V | 0.222 | 0.105 | 0.9 | 0.025 |
Rows: Each row represents a vehicle, with the first row always representing the ego vehicle.
Columns: Each column is a feature that is described below:
Feature Description presence Disambiguate agents at 0 offset from non-existent agents. Vehicle Indicates the vehicle's name. $x$ World offset of ego vehicle or offset to ego vehicle on the $x$ axis.$y$ World offset of ego vehicle or offset to ego vehicle on the y axis. $v x$ Velocity on the $x$ axis of vehicle.$v y$ Velocity on the y axis of vehicle.
In HighwayEnv, the reward function balances speed optimization and collision avoidance:
Speed Reward: Encourages the agent to drive at higher speeds, scaled between the minimum
$v_{\min}$ and maximum$v_{\max}$ speeds. - Collision Penalty: Penalizes the agent for collisions with other vehicles, promoting safer driving behavior.
$a$ and$b$ : Adjust the influence of speed optimization and collision avoidance in the overall reward.
Deep Q-learning is a value-based reinforcement learning algorithm where a neural network is used to approximate the Q-value function, which predicts the expected cumulative reward for taking an action in a given state. The Decision-Making Using Deep Reinforcement Learning in Autonomous Driving Tasks.ipynb
includes implementations of Q-learning with both linear and convolutional neural networks (CNNs), experience replay, and various training and evaluation procedures.
The following hyperparameters are used in our DQN implementation:
Parameter | Value |
BUFFER_SIZE | 10000 |
GAMMA | 0.99 |
Learning rate (LR) | 0.0005 |
α (Q-learning parameter) | 0.001 |
Epsilon start | 1 |
Epsilon end | 0.001 |
Epsilon decay | 0.995 |
Number of iterations (runs) | 5 |
Number of episodes | 3600 |
Max step | 10000 |
: Defines the linear neural network architecture.QNetwork_CNN
: Defines the convolutional neural network architecture.ReplayBuffer
: Implements the experience replay buffer.Agent
: Encapsulates the DQN agent's behavior.DQN
: Handles training and evaluation of the agent.train_with_state
: Trains the agent using state-based observations.train_with_observation
: Trains the agent using image-based observations.evaluation
: Evaluates the trained agent.
The linear neural network consists of three fully connected layers:
class QNetwork_Linear(nn.Module):
def __init__(self, state_size, action_size, seed):
super(QNetwork_Linear, self).__init__()
self.seed = torch.manual_seed(seed)
self.fc1 = nn.Linear(state_size, 125)
self.fc2 = nn.Linear(125, 125)
self.fc3 = nn.Linear(125, action_size)
def forward(self, state):
x = self.fc1(state)
x = F.relu(x)
x = self.fc2(x)
x = F.relu(x)
x = self.fc3(x)
return x
The CNN architecture includes convolutional, max-pooling, and dropout layers followed by fully connected layers:
class QNetwork_CNN(nn.Module):
def __init__(self, action_size, seed):
super(QNetwork_CNN, self).__init__()
self.seed = torch.manual_seed(seed)
self.conv1 = nn.Conv2d(3, 128, kernel_size=3)
self.relu = nn.ReLU()
self.maxpool = nn.MaxPool2d(kernel_size=2)
self.dropout = nn.Dropout(0.1)
self.conv2 = nn.Conv2d(128, 128, kernel_size=3)
self.flatten = nn.Flatten()
self.fc1 = nn.Linear(38912, 64)
self.fc2 = nn.Linear(64, action_size)
def forward(self, state):
x = self.dropout(self.maxpool(self.relu(self.conv1(state))))
x = self.dropout(self.maxpool(self.relu(self.conv2(x))))
x = self.flatten(x)
x = self.fc1(x)
x = self.relu(x)
x = self.fc2(x)
return x
The ReplayBuffer
class stores experiences and samples them for training:
class ReplayBuffer():
def __init__(self, action_size, buffer_size, batch_size, seed):
self.action_size = action_size
self.memory = deque(maxlen=buffer_size)
self.batch_size = batch_size
self.experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])
self.seed = random.seed(seed)
def add(self, state, action, reward, next_state, done):
e = self.experience(state, action, reward, next_state, done)
def sample(self):
experiences = random.sample(self.memory, k=self.batch_size)
states = torch.from_numpy(np.vstack([e.state for e in experiences if e is not None])).float().to(device)
actions = torch.from_numpy(np.vstack([e.action for e in experiences if e is not None])).long().to(device)
rewards = torch.from_numpy(np.vstack([e.reward for e in experiences if e is not None])).float().to(device)
next_states = torch.from_numpy(np.vstack([e.next_state for e in experiences if e is not None])).float().to(device)
dones = torch.from_numpy(np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)).float().to(device)
return (states, actions, rewards, next_states, dones)
def __len__(self):
return len(self.memory)
The Agent
class implements the DQN algorithm, managing the Q-networks, the optimizer, and the interaction with the replay buffer:
class Agent():
def __init__(self, state_size, action_size, network_type, seed):
self.state_size = state_size
self.action_size = action_size
self.seed = random.seed(seed)
self.network_type = network_type
if network_type in ['linear', 'Linear']:
self.qnetwork_local = QNetwork_Linear(state_size, action_size, seed).to(device)
self.qnetwork_target = QNetwork_Linear(state_size, action_size, seed).to(device)
elif network_type in ['cnn', 'CNN']:
self.qnetwork_local = QNetwork_CNN(action_size, seed).to(device)
self.qnetwork_target = QNetwork_CNN(action_size, seed).to(device)
self.optimizer = optim.Adam(self.qnetwork_local.parameters(), lr=LR)
self.memory = ReplayBuffer(action_size, BUFFER_SIZE, BATCH_SIZE, seed)
self.t_step = 0
self.resize = T.Compose([T.ToPILImage(), T.Resize(40, interpolation=Image.CUBIC), T.ToTensor()])
def get_screen(self, screen):
screen = np.ascontiguousarray(screen, dtype=np.float32) / 255
screen = torch.from_numpy(screen)
screen = screen.permute(2, 0, 1)
screen_resized = self.resize(screen).unsqueeze(0)
return screen_resized
def step(self, state, action, reward, next_state, done):
self.memory.add(state, action, reward, next_state, done)
self.t_step = (self.t_step + 1) % UPDATE_EVERY
if self.t_step == 0:
if len(self.memory) > BATCH_SIZE:
experiences = self.memory.sample()
self.learn(experiences, GAMMA)
def act(self, state, eps=0):
if self.network_type in ['linear', 'Linear']:
state = torch.from_numpy(state).float().unsqueeze(0).to(device)
elif self.network_type in ['cnn', 'CNN']:
state = state.to(device)
with torch.no_grad():
action_values = self.qnetwork_local(state)
if random.random() > eps:
return np.argmax(action_values.cpu().data.numpy())
return random.choice(np.arange(self.action_size))
def learn(self, experiences, gamma):
states, actions, rewards, next_states, dones = experiences
q_targets_next = self.qnetwork_target(next_states).detach().max(1)[0].unsqueeze(1)
q_targets = rewards + gamma * q_targets_next * (1 - dones)
q_expected = self.qnetwork_local(states).gather(1, actions)
loss = F.mse_loss(q_expected, q_targets)
self.soft_update(self.qnetwork_local, self.qnetwork_target, 𝛼)
def soft_update(self, local_model, target_model, 𝛼):
for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
target_param.data.copy_(𝛼*local_param.data + (1-𝛼)*target_param.data)
class manages the training and evaluation of the agent, providing methods for both state-based and observation-based training:
class DQN():
def __init__(self, env, env_name, model_path, data_path, network_type, env_name_source=None, model_path_source=None, transfer_episode=None):
self.env = env
self.env_name = env_name
self.model_path = model_path
self.data_path = data_path
self.network_type = network_type
def train_with_state(self, n_episodes, max_t, eps_start, eps_end, eps_decay):
scores = []
scores_window = deque(maxlen=100)
eps = eps_start
for i_episode in range(1, n_episodes+1):
state = self.env.reset()
agent = Agent(state_size=len(state), action_size=self.env.action_space.n, network_type=self.network_type, seed=0)
score = 0
for t in range(max_t):
action = agent.act(state, eps)
next_state, reward, done, _ = self.env.step(action)
agent.step(state, action, reward, next_state, done)
state = next_state
score += reward
if done:
eps = max(eps_end, eps_decay*eps)
if i_episode % 100 == 0:
print(f'\rEpisode {i_episode}\tAverage Score: {np.mean(scores_window)}')
return scores
def train_with_observation(self, n_episodes, max_t, eps_start, eps_end, eps_decay):
scores = []
scores_window = deque(maxlen=100)
eps = eps_start
for i_episode in range(1, n_episodes+1):
last_screen = env.render(mode='rgb_array')
current_screen = env.render(mode='rgb_array')
state = current_screen - last_screen
agent = Agent(state_size=env.observation_space.shape[0], action_size=env.action_space.n, network_type=self.network_type, seed=0)
score = 0
for t in range(max_t):
action = agent.act(state, eps)
_, reward, done, _ = env.step(action)
last_screen = current_screen
current_screen = env.render(mode='rgb_array')
if not done:
next_state = current_screen - last_screen
next_state = None
agent.step(state, action, reward, next_state, done)
state = next_state
score += reward
if done:
eps = max(eps_end, eps_decay*eps)
if i_episode % 100 == 0:
print(f'\rEpisode {i_episode}\tAverage Score: {np.mean(scores_window)}')
return scores
def evaluation(self, agent, n_episodes, max_t, filename):
scores = []
for i_episode in range(1, n_episodes+1):
state = self.env.reset()
score = 0
for t in range(max_t):
action = agent.act(state, eps=0)
state, reward, done, _ = self.env.step(action)
score += reward
if done:
with open(filename, 'wb') as f:
pickle.dump(scores, f)
return scores
Training and Evaluating with state-based
observations on merge-v0
# Initialize DQN agent with linear network
env_name = 'merge-v0'
env = gym.make(env_name, render_mode='rgb_array')
dqn_merge = DQN(env, env_name, model_path, data_path, network_type='linear')
# Training
df_reward_dqn_merge = dqn_merge.train_with_state(n_iteration=[1, 2, 3, 4, 5], n_training_episodes=3600, max_step=10000)
# Evaluating
sum_rewards = dqn_merge.evaluation(video_path+"merge.mp4", evaluate_type='state', iter_num=2, evaluate_episode_num=2000, use_saved_model=True)
# Show video
show_and_plot().show_video(directory=video_path, file_name='merge.mp4')
Training and Evaluating with state-based
observations on highway-fast-v0
# Initialize DQN agent with linear network
env_name = 'highway-fast-v0'
env = gym.make(env_name, render_mode='rgb_array')
dqn_fastHighway = DQN(env, env_name, model_path, data_path, network_type='linear')
# Training
df_reward_dqn_fastHighway = dqn_fastHighway.train_with_state(n_iteration=[1, 2, 3, 4, 5], n_training_episodes=3600, max_step=10000)
# Evaluating
sum_rewards = dqn_fastHighway.evaluation(video_path+"fastHighway.mp4", evaluate_type='state', iter_num=1, evaluate_episode_num=3400, use_saved_model=True)
# Show video
show_and_plot().show_video(directory=video_path, file_name='fastHighway.mp4')
Training and Evaluating with image-based
observations on merge-v0
# Initialize DQN agent with CNN network
env_name = 'merge-v0'
env = gym.make(env_name, render_mode='rgb_array')
dqn_merge_cnn = DQN(env, env_name, model_path, data_path, network_type='CNN')
# Training
df_reward_dqn_merge_cnn = dqn_merge_cnn.train_with_observation(n_iteration=[1, 2, 3, 4, 5], n_training_episodes=3600, max_step=10000)
# Evaluating
sum_rewards = dqn_merge_cnn.evaluation(video_path+"merge_observation_CNN.mp4", evaluate_type='observation', iter_num=2, evaluate_episode_num=2000, use_saved_model=True)
# Show video
show_and_plot().show_video(directory=video_path, file_name="merge_observation_CNN.mp4")
Training and Evaluating with image-based
observations on highway-fast-v0
# Initialize DQN agent with CNN network
env_name = 'highway-fast-v0'
env = gym.make(env_name, render_mode='rgb_array')
dqn_fasthighway_cnn = DQN(env, env_name, model_path, data_path, network_type='CNN')
# Training
df_reward_dqn_fasthighway_cnn = dqn_fasthighway_cnn.train_with_observation(n_iteration=[1, 2, 3, 4, 5], n_training_episodes=3600, max_step=10000)
# Evaluating
sum_rewards = dqn_fasthighway_cnn.evaluation(video_path+"fast_observation_CNN.mp4", evaluate_type='observation', iter_num=2, evaluate_episode_num=2000, use_saved_model=True)
# Show video
show_and_plot().show_video(directory=video_path, file_name="fast_observation_CNN.mp4")
Transfer learning from merge-v0
on highway-fast-v0
env_name_source = 'merge-v0'
env_name_destination = 'highway-fast-v0'
env = gym.make(env_name_destination, render_mode='rgb_array')
dqn_fastHighway_transferred = DQN(env, env_name_destination, model_path_destination, data_path, network_type='linear', env_name_source=env_name_source, model_path_source=model_path_source, transfer_episode=3600)
# Training
df_reward_dqn_fastHighway_transferred = dqn_fastHighway_transferred.train_with_state(n_iteration=[1, 2, 3, 4, 5], n_training_episodes=3600, max_step=10000)
# Evaluating
sum_rewards = dqn_fastHighway_transferred.evaluation(video_path+"fastHighway_transferred_from_merge.mp4", evaluate_type='state', iter_num=4, evaluate_episode_num=3600, use_saved_model=True)
# Show video
show_and_plot().show_video(directory=video_path, file_name="fastHighway_transferred_from_merge.mp4")
The Merge-v0 task simulates a vehicle merging into traffic. The average reward obtained during the learning episodes is shown below:
- Average Reward during the learning episodes for the merge-v0 task:
The Highway-Fast task involves high-speed driving on a highway. We compare the performance of the DQN algorithm with random initial weights and with weights transferred from the Merge-v0 task (Transfer Learning).
- Average Reward during the learning episodes for the highway-fast task with and without transfer learning:
In this section, we use convolutional neural networks (CNNs) and image-based observations instead of state matrices. Each observation is derived from the difference between two rendered images.
- Average reward during learning episodes for the merge-v0 task with state, and CNN network and observation:
The DQN algorithm with CNN performs better than the DQN with linear networks and state matrices. It has earned more average rewards in all episodes except the initial episodes. However, this advantage is not guaranteed, and the performance depends on the quality of the observations. So, The DQN algorithm with observation won't always perform better than the DQN algorithm with state. If the observation can be estimated to have the characteristics of the state, they may be better than when we train the algorithm with the state.