Reinforcement Learning

A machine learning paradigm where an agent learns a policy through trial and error to maximize cumulative reward from an environment.

Key algorithms include Q-learning, policy gradients, and temporal-difference learning, as detailed in Sutton and Barto's definitive textbook.

Inside Reinforcement Learning (9)

Average reward per episode — Cumulative reward divided by number of episodes, smoothing fluctuations to reveal learning trends.
Cumulative reward — Total rewards collected in an episode or across episodes, often discounted by factor γ.
Episode length — Number of steps taken to complete an episode; decreasing length may indicate more efficient goal-reaching.
Exploration vs. exploitation ratio — Balance between trying new actions and using known high-reward ones, e.g., ε-greedy with ε=0.1.
Policy stability — How often the agent changes its policy after convergence, indicating confidence and consistency.
Reinforcement learning in production — Reinforcement learning is used by Netflix to optimize recommendation algorithms over time by maximizing user satisfaction and engagement.
Sample efficiency — How well the agent learns from few experiences; critical when data collection is expensive.
Success rate — Ratio of successful episodes to total episodes, a direct measure of task completion.
Time to convergence — Number of episodes or steps until the policy stabilizes; lower is better when training is costly.

This is the text view of an interactive 3D knowledge graph — open this page with JavaScript enabled to explore it visually.