A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://www.geeksforgeeks.org/machine-learning/actor-critic-algorithm-in-reinforcement-learning/ below:

Actor-Critic Algorithm in Reinforcement Learning

Actor-Critic Algorithm in Reinforcement Learning

Last Updated : 23 Jul, 2025

Actor-Critic Algorithm is a type of reinforcement learning algorithm that combines two parts i.e the Actor which selects actions and the Critic which evaluates them. This helps the agent learn more effectively by balancing decision-making and feedback. In the actor-critic method the actor learns how to make decisions and the critic checks how good those decisions are. This dual role helps the agent explore new actions while also using what it has learned and make the learning process better and more balanced.

Key Terms in Actor Critic Algorithm

There are two key terms:

1. Policy (Actor) :

2. Value Function (Critic) :

How Actor-Critic algorithm works? Actor Critic Algorithm Objective Function 1. Policy Gradient (Actor)

\nabla_\theta J(\theta)\approx \frac{1}{N} \sum_{i=0}^{N} \nabla_\theta \log\pi_\theta (a_i|s_i)\cdot A(s_i,a_i)

Here,

2. Value Function Update (Critic)

\nabla_w J(w) \approx \frac{1}{N}\sum_{i=1}^{N} \nabla_w (V_{w}(s_i)- Q_{w}(s_i , a_i))^2

Here,

Update Rules

The update rules for the actor and critic involve adjusting their respective parameters using gradient ascent (for the actor) and gradient descent (for the critic).

Actor Update

\theta_{t+1}= \theta_t + \alpha \nabla_\theta J(\theta_t)

Here,

Critic Update

w_{t} = w_t -\beta \nabla_w J(w_t)

Here

Advantage Function

The advantage function, A(s,a) measures the advantage of taking action a in state s​ over the expected value of the state under the current policy.

A(s,a)=Q(s,a)−V(s)

The advantage function, then, provides a measure of how much better or worse an action is compared to the average action. These mathematical expressions highlight the essential computations involved in the Actor-Critic method. The actor is updated based on the policy gradient, encouraging actions with higher advantages while the critic is updated to minimize the difference between the estimated value and the action-value.

Training Agent: Actor-Critic Algorithm

Let's understand how the Actor-Critic algorithm works in practice. Below is an implementation of a simple Actor-Critic algorithm using TensorFlow and OpenAI Gym to train an agent in the CartPole environment.

Step 1: Import Libraries Python
import numpy as np
import tensorflow as tf
import gym
Step 2: Creating CartPole Environment

Create the CartPole environment using the gym.make() function from the Gym library because it provides a standardized and convenient way to interact with various reinforcement learning tasks.

Python
# Create the CartPole Environment
env = gym.make('CartPole-v1')
Step 3: Defining Actor and Critic Networks Python
# Define the actor and critic networks
actor = tf.keras.Sequential([
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(env.action_space.n, activation='softmax')
])

critic = tf.keras.Sequential([
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(1)
])
Step 4: Defining Optimizers and Loss Functions

We use Adam optimizer for both networks.

Python
# Define optimizer and loss functions
actor_optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
critic_optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
Step 5: Training Loop

The training loop runs for 1000 episodes with the agent interacting with the environment, calculating advantages and updating both the actor and critic.

Python
# Main training loop
num_episodes = 1000
gamma = 0.99

for episode in range(num_episodes):
    state = env.reset()
    episode_reward = 0

    with tf.GradientTape(persistent=True) as tape:
        for t in range(1, 10000):  # Limit the number of time steps
            # Choose an action using the actor
            action_probs = actor(np.array([state]))
            action = np.random.choice(env.action_space.n, p=action_probs.numpy()[0])

            # Take the chosen action and observe the next state and reward
            next_state, reward, done, _ = env.step(action)

            # Compute the advantage
            state_value = critic(np.array([state]))[0, 0]
            next_state_value = critic(np.array([next_state]))[0, 0]
            advantage = reward + gamma * next_state_value - state_value

            # Compute actor and critic losses
            actor_loss = -tf.math.log(action_probs[0, action]) * advantage
            critic_loss = tf.square(advantage)

            episode_reward += reward

            # Update actor and critic
            actor_gradients = tape.gradient(actor_loss, actor.trainable_variables)
            critic_gradients = tape.gradient(critic_loss, critic.trainable_variables)
            actor_optimizer.apply_gradients(zip(actor_gradients, actor.trainable_variables))
            critic_optimizer.apply_gradients(zip(critic_gradients, critic.trainable_variables))

            if done:
                break

    if episode % 10 == 0:
        print(f'Episode {episode}, Reward: {episode_reward}')

env.close()

Output:

Advantages of Actor Critic Algorithm

The Actor-Critic method offer several advantages:

Variants of Actor-Critic Algorithms

Several variants of the Actor-Critic algorithm have been developed to address specific challenges or improve performance in certain types of environments:

A(s_t, a_t) = Q(s_t, a_t) - V(s_t)

A2C helps reduce the variance of the policy gradient, leading to better learning performance.



RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4