Q-Learning is a popular model-free reinforcement learning algorithm that helps an agent learn how to make the best decisions by interacting with its environment. Instead of needing a model of the environment the agent learns purely from experience by trying different actions and seeing their results
Q LearningImagine a system that sees an apple but incorrectly says, “It’s a mango.” The system is told, “Wrong! It’s an apple.” It learns from this mistake. Next time, when shown the apple, it correctly says “It’s an apple.” This trial-and-error process, guided by feedback is like how Q-Learning works.
The core idea is that the agent builds a Q-table which stores Q-values. Each Q-value estimates how good it is to take a specific action in a given state in terms of the expected future rewards. Over time the agent updates this table using the feedback it receives
Key Components of Q-learning 1. Q-Values or Action-ValuesQ-values represent the expected rewards for taking an action in a specific state. These values are updated over time using the Temporal Difference (TD) update rule.
2. Rewards and EpisodesThe agent moves through different states by taking actions and receiving rewards. The process continues until the agent reaches a terminal state which ends the episode.
3. Temporal Difference or TD-UpdateThe agent updates Q-values using the formula:
Q(S,A)\leftarrow Q(S,A) + \alpha (R + \gamma Q({S}',{A}') - Q(S,A))
Where,
The ϵ-greedy policy helps the agent decide which action to take based on the current Q-value estimates:
Q-learning models follow an iterative process where different components work together to train the agent. Here's how it works step-by-step:
Q learning algorithm 1. Start at a State (S)The environment provides the agent with a starting state which describes the current situation or condition.
2. Agent Selects an Action (A)Based on the current state and the agent chooses an action using its policy. This decision is guided by a Q-table which estimates the potential rewards for different state-action pairs. The agent typically uses an ε-greedy strategy:
The agent performs the selected action. The environment then provides:
The agent updates the Q-table using the new experience:
With updated Q-values the agent:
Over time the agent learns the optimal policy that consistently yields the highest possible reward in the environment.
Methods for Determining Q-values 1. Temporal Difference (TD):Temporal Difference is calculated by comparing the current state and action values with the previous ones. It provides a way to learn directly from experience, without needing a model of the environment.
2. Bellman’s Equation:Bellman’s Equation is a recursive formula used to calculate the value of a given state and determine the optimal action. It is fundamental in the context of Q-learning and is expressed as:
Q(s, a) = R(s, a) + \gamma \max_a Q(s', a)
Where:
The Q-table is essentially a memory structure where the agent stores information about which actions yield the best rewards in each state. It is a table of Q-values representing the agent's understanding of the environment. As the agent explores and learns from its interactions with the environment, it updates the Q-table. The Q-table helps the agent make informed decisions by showing which actions are likely to lead to better rewards.
Structure of a Q-table:
Over time, as the agent learns and refines its Q-values through exploration and exploitation, the Q-table evolves to reflect the best actions for each state, leading to optimal decision-making.
Implementation of Q-LearningHere, we implement basic Q-learning algorithm where agent learns the optimal action-selection strategy to reach a goal state in a grid-like environment.
Step 1: Define the EnvironmentSet up the environment parameters including the number of states and actions and initialize the Q-table. In this each state represents a position and actions move the agent within this environment.
Python
import numpy as np
n_states = 16
n_actions = 4
goal_state = 15
Q_table = np.zeros((n_states, n_actions))
Step 2: Set Hyperparameters
Define the parameters for the Q-learning algorithm which include the learning rate, discount factor, exploration probability and the number of training epochs.
Python
learning_rate = 0.8
discount_factor = 0.95
exploration_prob = 0.2
epochs = 1000
Step 3: Implement the Q-Learning Algorithm
Perform the Q-learning algorithm over multiple epochs. Each epoch involves selecting actions based on an epsilon-greedy strategy updating Q-values based on rewards received and transitioning to the next state.
Python
for epoch in range(epochs):
current_state = np.random.randint(0, n_states)
while current_state != goal_state:
if np.random.rand() < exploration_prob:
action = np.random.randint(0, n_actions)
else:
action = np.argmax(Q_table[current_state])
next_state = (current_state + 1) % n_states
reward = 1 if next_state == goal_state else 0
Q_table[current_state, action] += learning_rate * \
(reward + discount_factor *
np.max(Q_table[next_state]) - Q_table[current_state, action])
current_state = next_state
Step 4: Output the Learned Q-Table
After training, print the Q-table to examine the learned Q-values which represent the expected rewards for taking specific actions in each state.
Python
q_values_grid = np.max(Q_table, axis=1).reshape((4, 4))
# Plot the grid of Q-values
plt.figure(figsize=(6, 6))
plt.imshow(q_values_grid, cmap='coolwarm', interpolation='nearest')
plt.colorbar(label='Q-value')
plt.title('Learned Q-values for each state')
plt.xticks(np.arange(4), ['0', '1', '2', '3'])
plt.yticks(np.arange(4), ['0', '1', '2', '3'])
plt.gca().invert_yaxis() # To match grid layout
plt.grid(True)
# Annotating the Q-values on the grid
for i in range(4):
for j in range(4):
plt.text(j, i, f'{q_values_grid[i, j]:.2f}', ha='center', va='center', color='black')
plt.show()
# Print learned Q-table
print("Learned Q-table:")
print(Q_table)
Output:
Q values on gridThe learned Q-table shows the expected rewards for each state-action pair, with higher Q-values near the goal state (state 15), indicating the optimal actions that lead to reaching the goal. The agent's actions gradually improve over time, as reflected in the increasing Q-values across states leading to the goal.
Advantages of Q-learningRetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4