### EITC/AI/ARL Advanced Reinforced Learning is the European IT Certification programme on DeepMind’s approach to reinforced learning in artificial intelligence.

The curriculum of the EITC/AI/ARL Advanced Reinforced Learning focuses on theoretical aspects and practical skills in reinforced learning techniques from the perspective of DeepMind organized within the following structure, encompassing comprehensive video didactic content as a reference for this EITC Certification.

Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.

Reinforcement learning differs from supervised learning in not needing labelled input/output pairs be presented, and in not needing sub-optimal actions to be explicitly corrected. Instead the focus is on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).

The environment is typically stated in the form of a Markov decision process (MDP), because many reinforcement learning algorithms for this context use dynamic programming techniques. The main difference between the classical dynamic programming methods and reinforcement learning algorithms is that the latter do not assume knowledge of an exact mathematical model of the MDP and they target large MDPs where exact methods become infeasible.

Due to its generality, reinforcement learning is studied in many disciplines, such as game theory, control theory, operations research, information theory, simulation-based optimization, multi-agent systems, swarm intelligence, and statistics. In the operations research and control literature, reinforcement learning is called approximate dynamic programming, or neuro-dynamic programming. The problems of interest in reinforcement learning have also been studied in the theory of optimal control, which is concerned mostly with the existence and characterization of optimal solutions, and algorithms for their exact computation, and less with learning or approximation, particularly in the absence of a mathematical model of the environment. In economics and game theory, reinforcement learning may be used to explain how equilibrium may arise under bounded rationality.

Basic reinforcement is modeled as a Markov decision process (MDP). In mathematics, a Markov decision process (MDP) is a discrete-time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. MDPs are useful for studying optimization problems solved via dynamic programming. MDPs were known at least as early as the 1950s. A core body of research on Markov decision processes resulted from Ronald Howard’s 1960 book, Dynamic Programming and Markov Processes. They are used in many disciplines, including robotics, automatic control, economics and manufacturing. The name of MDPs comes from the Russian mathematician Andrey Markov as they are an extension of Markov chains.

At each time step, the process is in some state S, and the decision maker may choose any action a that is available in state S. The process responds at the next time step by randomly moving into a new state S’, and giving the decision maker a corresponding reward Ra(S,S’).

The probability that the process moves into its new state S’ is influenced by the chosen action a. Specifically, it is given by the state transition function Pa(S,S’). Thus, the next state S’ depends on the current state S and the decision maker’s action a. But given S and a, it is conditionally independent of all previous states and actions. In other words, the state transitions of an MDP satisfy the Markov property.

Markov decision processes are an extension of Markov chains; the difference is the addition of actions (allowing choice) and rewards (giving motivation). Conversely, if only one action exists for each state (e.g. “wait”) and all rewards are the same (e.g. “zero”), a Markov decision process reduces to a Markov chain.

A reinforcement learning agent interacts with its environment in discrete time steps. At each time t, the agent receives the current state S(t) and reward r(t). It then chooses an action a(t) from the set of available actions, which is subsequently sent to the environment. The environment moves to a new state S(t+1) and the reward r(t+1) associated with the transition is determined. The goal of a reinforcement learning agent is to learn a policy which maximizes the expected cumulative reward.

Formulating the problem as a MDP assumes the agent directly observes the current environmental state. In this case the problem is said to have full observability. If the agent only has access to a subset of states, or if the observed states are corrupted by noise, the agent is said to have partial observability, and formally the problem must be formulated as a Partially observable Markov decision process. In both cases, the set of actions available to the agent can be restricted. For example, the state of an account balance could be restricted to be positive; if the current value of the state is 3 and the state transition attempts to reduce the value by 4, the transition will not be allowed.

When the agent’s performance is compared to that of an agent that acts optimally, the difference in performance gives rise to the notion of regret. In order to act near optimally, the agent must reason about the long-term consequences of its actions (i.e., maximize future income), although the immediate reward associated with this might be negative.

Thus, reinforcement learning is particularly well-suited to problems that include a long-term versus short-term reward trade-off. It has been applied successfully to various problems, including robot control, elevator scheduling, telecommunications, backgammon, checkers and Go (AlphaGo).

Two elements make reinforcement learning powerful: the use of samples to optimize performance and the use of function approximation to deal with large environments. Thanks to these two key components, reinforcement learning can be used in large environments in the following situations:

- A model of the environment is known, but an analytic solution is not available.
- Only a simulation model of the environment is given (the subject of simulation-based optimization).
- The only way to collect information about the environment is to interact with it.

The first two of these problems could be considered planning problems (since some form of model is available), while the last one could be considered to be a genuine learning problem. However, reinforcement learning converts both planning problems to machine learning problems.

The exploration vs. exploitation trade-off has been most thoroughly studied through the multi-armed bandit problem and for finite state space MDPs in Burnetas and Katehakis (1997).

Reinforcement learning requires clever exploration mechanisms; randomly selecting actions, without reference to an estimated probability distribution, shows poor performance. The case of (small) finite Markov decision processes is relatively well understood. However, due to the lack of algorithms that scale well with the number of states (or scale to problems with infinite state spaces), simple exploration methods are the most practical.

Even if the issue of exploration is disregarded and even if the state was observable, the problem remains to use past experience to find out which actions lead to higher cumulative rewards.

To acquaint yourself in-detail with the certification curriculum you can expand and analyze the table below.

The EITC/AI/ARL Advanced Reinforced Learning Certification Curriculum references open-access didactic materials in a video form. Learning process is divided into a step-by-step structure (programmes -> lessons -> topics) covering relevant curriculum parts. Unlimited consultancy with domain experts are also provided.

For details on the Certification procedure check How it Works.

### Curriculum Reference Resources

Human level control through Deep Reinforcement Learning publication

https://deepmind.com/research/publications/human-level-control-through-deep-reinforcement-learning

Open-access course on deep reinforcement learning at UC Berkeley

http://rail.eecs.berkeley.edu/deeprlcourse/

RL applied to K-armbed bandit problem from Manifold.ai

https://www.manifold.ai/exploration-vs-exploitation-in-reinforcement-learning