MDP

  • State: describes the agent’s status with respect to the environment.
  • State space: The set of all the states is called the state space, denoted as S = {s1, . . . , $s_n$}.
  • Action:
  • Policy: tells the agent what action to take in a given state.
  • Reward: After executing an action at a state, the agent obtains a reward, denoted as r, as feedback from the environment. The reward is a function of the state s and action a. Hence, it is also denoted as r(s, a).
  • Trajectory: state-action-reward chain.
  • Return of the the trajectory: the sum of all rewards received along the trajectory.
  • Discounted return with discount rate $\gamma$: $\sum_{t=0}^{\infty} \gamma^t r(s_t, a_t)$
  • Episode: When interacting with the environment by following a policy, the agent may stop at some terminal states. The resulting trajectory is called an episode (or a trial ).

Markov Decision Process

  • Sets:

    • State space: the set of all states, $S$.
    • Action space: a set of actions, $A(s)$, associated with each state $s \in S$.
    • Reward set: a set of rewards, denoted as $R(s, a)$, associated with each state-action pair (s, a)
  • Model:

    • State transition probability: In state $s$, when taking action $a$, the probability of transitioning to state s′ is $p(s’|s,a)$.
    \[\sum_{s′ \in S} p(s′|s, a) = 1\]
    • Reward probability: In state $s$, when taking action $a$, the probability of obtaining reward $r$ is $p(r|s, a)$. For any $(s, a)$,
    \[\sum_{r \in R(s,a)} p(r|s, a) = 1\]
  • Policy: In state s, the probability of choosing action $a$ is $\pi(a|s)$. For any $s \in S$,

    \[\sum_{a \in A(s)} \pi(a|s) = 1\]
  • Markov property: memoryless property of a stochastic process.

    \[p(s_{t+1}|s_t, a_t, s_{t-1}, a_{t-1}, \ldots, s_0, a_0) = p(s_{t+1}|s_t, a_t)\] \[p(r_{t+1}|s_t, a_t, s_{t-1}, a_{t-1}, \ldots, s_0, a_0) = p(r_{t+1}|s_t, a_t)\]

State value:

  • Discounted return along the trajectory: $G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$, $\gamma \in [0, 1)$ is the discount rate.
  • State value:

    \[\begin{aligned} v_{\pi}(s) &= \mathbb{E}[G_t | S_t = s] \\\\ &= \mathbb{E}[R_{t+1} + \gamma G_{t+1} | S_t = s] \\\\ &= \mathbb{E}[R_{t+1}|S_t=s] + \gamma \mathbb{E}[G_{t+1}|S_t=s] \end{aligned}\]



Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Megascale
  • Mycroft
  • Minder
  • ByteRobust
  • Training with Confidence