Reinforcement Learning | Sahithyan's Notes

The process of learning to maximize expected rewards by trial and error. Feedback is given in the form of rewards. Agent’s utility is defined by the reward function. All learning is based on observed samples of outcomes.

Builds on MDPs. Here, transition function and reward function are unknown. Can’t simulate the world as it is not a known model. Online learning.

Defined by:

States $S$
Actions $A(s)$
Policy $\pi(s)$

Model-Based vs. Model-Free

Model-Based Learning

Learns from a model of the environment. Observations through multiple simulations, $T$ and $R$ functions are learnt. Used when the environment is known.

Model-Free Learning

Learns $T$ , $R$ directly from interactions with the environment. Used when the environment is too complex or cannot be modelled. Can use either sample-based policy evaluation or control methods.

Sample-Based Policy Evaluation

A method for estimating the value of a given policy $\pi$ by using averages obtained from samples. Used in some model-free learning algorithms.

Over multiple episodes, following $\pi(s)$ , the following update rule is executed.

V_{\pi}(s) \approx \dfrac{1}{N(s)} \sum_{i=1}^{N(s)} G_i(s)

Here:

$G_i(s)$ be the discounted sum of rewards from a visit to $s$
$N(s)$ be the number of samples for $s$

Active vs. Passive

Active

Learns the best policy by interacting with the environment.

Exploration vs. Exploitation

A challenge for active RL. Agent mustb balance:

Explore: try new actions to discover rewards.
Exploit: use best-known action for high return.

Passive

Follows the given policy $\pi(s)$ . Learns the value of states using $\pi(s)$ .

Algorithms

All of the below algorithms use sample-based policy evaluation.

Direct Evaluation

Model-free. Passive. Learns the value of each state $V(s)$ by averaging the future reward observed after $s$ under a fixed policy $\pi(s)$ . Simple to implement.

Must wait until the episode ends to compute returns. Each state must be learnt separately. Slow.

import random

# Define a simple environment
states = ['A', 'B', 'C', 'D', 'E']
rewards = {'A': 2, 'B': 1, 'C': 5, 'D': 0, 'E': 10}
transitions = {
    'A': 'B',
    'B': 'C',
    'C': 'D',
    'D': 'E',
    'E': None  # terminal
}

# Parameters
gamma = 1.0  # no discount
num_episodes = 1000

# Value table and returns
V = {s: 0.0 for s in states}
returns = {s: [] for s in states}

for _ in range(num_episodes):
    # Start at random state
    state = random.choice(states)
    episode = []

    # Generate an episode following fixed policy (always move to next)
    while state is not None:
        reward = rewards[state]
        episode.append((state, reward))
        state = transitions[state]

    # Calculate returns from each state in the episode
    G = 0
    for t in reversed(range(len(episode))):
        s, r = episode[t]
        G = r + gamma * G
        returns[s].append(G)

# Compute average return for each state
for s in states:
    if returns[s]:
        V[s] = sum(returns[s]) / len(returns[s])

print("Estimated State Values (V):")
for s, v in V.items():
    print(f"{s}: {v:.2f}")

Temporal Difference Learning

Aka. TD Learning. Model-free. Passive. Smarter version of Direct Evaluation. Updates value estimates $V(s)$ after each step, not waiting for episode end. Combines ideas from Monte Carlo (learning from experience) and Dynamic Programming (bootstrapping using next state’s value). Incremental.

Faster than Direct Evaluation.

The sample value of $s$ using $\pi$ is calculated as:

V^s_\pi (s) = R(s,\pi(s),s') + \gamma V_\pi(s')

The update rule is:

V_\pi (s) \leftarrow (1-\alpha)V_\pi(s) + \alpha V^s_\pi(s)

Or:

V_\pi(s) \leftarrow V_\pi(s) + \alpha \Big[ R(s,a,s') + \gamma V(s') - V(s) \Big]

Here $\alpha$ is the learning rate. Larger $\alpha$ provides faster but noisier learning. Decreasing $\alpha$ gives convergent average.

Q-Learning

Model-free. Active. Learns action values $Q(s, a)$ directly to find the optimal policy. Does sample-based, q-value iteration.

Requires enough exploration to be optimal. The learning rate must be reduced. And the reduction must be gradual, not too quick.

The sample q-value of $s$ at $a$ using $\pi$ is calculated as:

Q_s (s,a) = R(s,a,s') + \gamma \max_{a'} Q(s',a')

The update rule is:

Q (s,a) \leftarrow (1-\alpha)Q(s,a) + \alpha Q_s(s)

Or:

Q(s, a) \leftarrow Q(s, a) + \alpha \Big[ R(s,a,s') + \gamma \max_a Q(s', a') - Q(s, a) \Big]

The optimal policy can be extracted using $\pi^*(s) = \arg\max_a Q(s,a)$ .