The process of learning to maximize expected rewards by trial and error. Feedback is given in the form of rewards. Agent’s utility is defined by the reward function. All learning is based on observed samples of outcomes.
Builds on MDPs. Here, transition function and reward function are unknown. Can’t simulate the world as it is not a known model. Online learning.
Defined by:
- States
- Actions
- Policy
Model-Based vs. Model-Free
Model-Based Learning
Learns from a model of the environment. Observations through multiple simulations, and functions are learnt. Used when the environment is known.
Model-Free Learning
Learns , directly from interactions with the environment. Used when the environment is too complex or cannot be modelled. Can use either sample-based policy evaluation or control methods.
Sample-Based Policy Evaluation
A method for estimating the value of a given policy by using averages obtained from samples. Used in some model-free learning algorithms.
Over multiple episodes, following , the following update rule is executed.
Here:
- be the discounted sum of rewards from a visit to
- be the number of samples for
Active vs. Passive
Active
Learns the best policy by interacting with the environment.
Exploration vs. Exploitation
A challenge for active RL. Agent mustb balance:
- Explore: try new actions to discover rewards.
- Exploit: use best-known action for high return.
Passive
Follows the given policy . Learns the value of states using .
Algorithms
All of the below algorithms use sample-based policy evaluation.
Direct Evaluation
Model-free. Passive. Learns the value of each state by averaging the future reward observed after under a fixed policy . Simple to implement.
Must wait until the episode ends to compute returns. Each state must be learnt separately. Slow.
import random
# Define a simple environment
states = ['A', 'B', 'C', 'D', 'E']
rewards = {'A': 2, 'B': 1, 'C': 5, 'D': 0, 'E': 10}
transitions = {
'A': 'B',
'B': 'C',
'C': 'D',
'D': 'E',
'E': None # terminal
}
# Parameters
gamma = 1.0 # no discount
num_episodes = 1000
# Value table and returns
V = {s: 0.0 for s in states}
returns = {s: [] for s in states}
for _ in range(num_episodes):
# Start at random state
state = random.choice(states)
episode = []
# Generate an episode following fixed policy (always move to next)
while state is not None:
reward = rewards[state]
episode.append((state, reward))
state = transitions[state]
# Calculate returns from each state in the episode
G = 0
for t in reversed(range(len(episode))):
s, r = episode[t]
G = r + gamma * G
returns[s].append(G)
# Compute average return for each state
for s in states:
if returns[s]:
V[s] = sum(returns[s]) / len(returns[s])
print("Estimated State Values (V):")
for s, v in V.items():
print(f"{s}: {v:.2f}")
Temporal Difference Learning
Aka. TD Learning. Model-free. Passive. Smarter version of Direct Evaluation. Updates value estimates after each step, not waiting for episode end. Combines ideas from Monte Carlo (learning from experience) and Dynamic Programming (bootstrapping using next state’s value). Incremental.
Faster than Direct Evaluation.
The sample value of using is calculated as:
The update rule is:
Or:
Here is the learning rate. Larger provides faster but noisier learning. Decreasing gives convergent average.
Q-Learning
Model-free. Active. Learns action values directly to find the optimal policy. Does sample-based, q-value iteration.
Requires enough exploration to be optimal. The learning rate must be reduced. And the reduction must be gradual, not too quick.
The sample q-value of at using is calculated as:
The update rule is:
Or:
The optimal policy can be extracted using .