Reinforcment learning Udacity

Hi!

I'm going to use this blog as my personal notes while I'm studying for the Reinforcement learning course by Udacity. I have go through the first two lessons and I'm going to list the main ideas studied.

There are three main topics in Machine learning 1) Supervised Learning (SL) 2) Unsupervised Learning (UL) and 3) Reinforcement Learning (RL). In SL we have a pair of <y, x> where y is the variable we are interesting (to predict or to classifie) and x are features. The objective is to find a function f(x) that can recreate our predicted variable. In UL we don't have a variable of interest, we have a bounch of features X and we want to find patterns or relationships in the data.

RL is more about learning from interaction with the environment. We interact with the environment using actions and we receive from the environment a reward. That is how we communicate with the environment Actions -> Reward -> Actions -> Reward.... The model introduce stochasticity by allowing the environment to change with some probability function. The action and the state of the enviroment determines our reward.

The objective of RL is to fin a function that maps states of the environment and actions. For every possible state we need to find the action that maximize our expected return. This function is call a Policy and is define as:

π^* = arg max _a E{U(s₁, s₂,...)|π}

In the above equation we introduce the concept of utility U(s₁, s₂,...). The idea is that we are constantly interacting with the environment and each period we receive a reward, the utility function can be view as the discounted sum of rewards during a infinite horizon:

U(s₁, s₂,...) = Σ_t=0 γ^tR(s_t)

where 0 <= γ <1

Lets re-write the first equation in terms of rewards

π^* = arg max _a E{ Σ_t=0 γ^tR(s_t)|π}

Know we define V(s) wich is a function that gives the "value" of any state given that we start in some state s₀ and follow the optimal policy:

V(s) = E{ Σ_t=0 γ^tR(s_t)|π^* s₀ = s}

If we know V(s) for every state, then the optimal policy would be to peak the action that returns the highest value. But we don' know V(s) in fact V(s) is defined in terms of the optimal policy, but we can define π^* in the following circular manner:

π^* = arg max _a Σ _s' Γ (s, a, s') V(s')

Where s' are the possible states that can take the environment in s_t+1 and Γ (s, a, s') is the transition probabilty Pr{s' | s, a}.

Bellman equation

Finally we can derive the most important equation, the bellman equation.

V(s) = R(s) + arg max _a γΣ _s' Γ (s, a, s') V(s')

The first term is just the immediate reward that gives the state s, R(s). The second term can be view as the discounted expected return value of the next state. This second term is a function of V(s'), so obviously:

V(s') = R(s') + arg max _a γΣ _s' Γ (s', a, s'') V(s'')

What we are doing is discounted each of the values we expect to receive in the future.

Nex time

Ok this is it for now. Any questions, error checking are appreciated. In the next post I introduce an small markov process to show in code how this equations look. Also I'm going the way to find the optimal policy (since all i did in this post was define the main aspects)

Thanks for readding.