Course outline
We present the outline of a course on reinforcement learning.
The reference book and key inspiration for this course and for reinforcement learning in general is the book by Sutton and Barto (it can be downloaded for free at that address). The reason for these lecture notes to exist is that we will have a more theoretical focus and we will provide toy code to play with the different algorithms.
Part 1: Multi-armed Bandits
We introduce the model of multi-armed bandit and discuss the exploration / exploitation trade-off. As a technical result we give a full proof for the analysis of the UCB algorithm for multi-armed bandits.
There is an accompanying github repository for experimenting with different algorithms for the multi-armed bandits.
Part 2: Markov decision processes and dynamic algorithms
We discuss the model of Markov decision processes and construct algorithms in the often unrealistic scenario where the whole MDP is both finite and fully known, called dynamic algorithms.
Part 3: Monte Carlo approaches, temporal differences and off-policy learning
We explain how to lift the assumption that the whole MDP is known by considering statistical approaches: Monte Carlo and temporal difference. We also discuss the role of on- and off-policy learning.
Part 4: Two-player games and self-play
We consider how to extend the algorithms for MDPs to two-player games, introducing the key idea of self-play. One of the most successful algorithm is the Monte Carlo tree search algorithm.
There is an accompanying github repository for implementations of some algorithms solving Tic-Tac-Toe.
Part 5: Deep reinforcement learning
We lift the assumption that the MDP is known and introduce function approximation using various models, such as linear models or neural networks.