MDP | Zhiting's space

State: describes the agent’s status with respect to the environment.
State space: The set of all the states is called the state space, denoted as S = {s1, . . . , $s_n$}.
Action:
Policy: tells the agent what action to take in a given state.
Reward: After executing an action at a state, the agent obtains a reward, denoted as r, as feedback from the environment. The reward is a function of the state s and action a. Hence, it is also denoted as r(s, a).
Trajectory: state-action-reward chain.
Return of the the trajectory: the sum of all rewards received along the trajectory.
Discounted return with discount rate $\gamma$: $\sum_{t=0}^{\infty} \gamma^t r(s_t, a_t)$
Episode: When interacting with the environment by following a policy, the agent may stop at some terminal states. The resulting trajectory is called an episode (or a trial ).

Sets:
- State space: the set of all states, $S$.
- Action space: a set of actions, $A(s)$, associated with each state $s \in S$.
- Reward set: a set of rewards, denoted as $R(s, a)$, associated with each state-action pair (s, a)
Model:
- State transition probability: In state $s$, when taking action $a$, the probability of transitioning to state s′ is $p(s’|s,a)$.
\[\sum_{s′ \in S} p(s′|s, a) = 1\]
- Reward probability: In state $s$, when taking action $a$, the probability of obtaining reward $r$ is $p(r|s, a)$. For any $(s, a)$,
\[\sum_{r \in R(s,a)} p(r|s, a) = 1\]
Policy: In state s, the probability of choosing action $a$ is $\pi(a|s)$. For any $s \in S$,
\[\sum_{a \in A(s)} \pi(a|s) = 1\]
Markov property: memoryless property of a stochastic process.
\[p(s_{t+1}|s_t, a_t, s_{t-1}, a_{t-1}, \ldots, s_0, a_0) = p(s_{t+1}|s_t, a_t)\] \[p(r_{t+1}|s_t, a_t, s_{t-1}, a_{t-1}, \ldots, s_0, a_0) = p(r_{t+1}|s_t, a_t)\]

Discounted return along the trajectory: $G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$, $\gamma \in [0, 1)$ is the discount rate.
State value:
\[\begin{aligned} v_{\pi}(s) &= \mathbb{E}[G_t | S_t = s] \\\\ &= \mathbb{E}[R_{t+1} + \gamma G_{t+1} | S_t = s] \\\\ &= \mathbb{E}[R_{t+1}|S_t=s] + \gamma \mathbb{E}[G_{t+1}|S_t=s] \end{aligned}\]

Enjoy Reading This Article?