From 1fe418518f63655e1845606e108c09cbae287988 Mon Sep 17 00:00:00 2001 From: absolutelyNoWarranty Date: Sun, 30 Oct 2016 11:12:33 +0800 Subject: [PATCH] Fix simple typos in README's --- DP/README.md | 4 ++-- DQN/README.md | 6 +++--- FA/README.md | 4 ++-- Introduction/README.md | 2 +- MDP/README.md | 10 +++++----- PolicyGradient/README.md | 2 +- 6 files changed, 14 insertions(+), 14 deletions(-) diff --git a/DP/README.md b/DP/README.md index a5460004d..234d736f5 100644 --- a/DP/README.md +++ b/DP/README.md @@ -11,7 +11,7 @@ ### Summary - Dynamic Programming (DP) methods assume that we have a perfect model of the environment's Markov Decision Process (MDP). That's usually not the case in practice, but it's important to study DP anyway. -- Policy Evaluation: Calculates the state-value function V(s) for a given policy. In DP this is done using a "full backup". At each state we look ahead one step at each possible action and next state. We can only do this because we have a perfect model of the environment. +- Policy Evaluation: Calculates the state-value function `V(s)` for a given policy. In DP this is done using a "full backup". At each state we look ahead one step at each possible action and next state. We can only do this because we have a perfect model of the environment. - Full backups are basically the Bellman equations turned into updates. - Policy Improvement: Given the correct state-value function for a policy we can act greedily with respect to it (i.e. pick the best action at each state). Then we are guaranteed to improve the policy or keep it fixed if it's already optimal. - Policy Iteration: Iteratively perform Policy Evaluation and Policy Improvement until we reach the optimal policy. @@ -43,4 +43,4 @@ - Implement Value Iteration in Python (Gridworld) - [Exercise](Value Iteration.ipynb) - - [Solution](Value Iteration Solution.ipynb) \ No newline at end of file + - [Solution](Value Iteration Solution.ipynb) diff --git a/DQN/README.md b/DQN/README.md index 2568b6a49..b07b3b3ec 100644 --- a/DQN/README.md +++ b/DQN/README.md @@ -11,10 +11,10 @@ ### Summary - DQN: Q-Learning but with a Deep Neural Network as a function approximator. -- Using a nolinear Deep Neural Network is powerful, but training is unstable if we apply it naively. +- Using a non-linear Deep Neural Network is powerful, but training is unstable if we apply it naively. - Trick 1 - Experience Replay: Store experience `(S, A, R, S_next)` in a replay buffer and sample minibatches from it to train the network. This decorrelates the data and leads to better data efficiency. In the beginning the replay buffer is filled with random experience. - Trick 2 - Target Network: Use a separate network to estimate the TD target. This target network has the same architecture as the function approximator but with frozen parameters. Every T steps (a hyperparameter) the parameters from the Q network are copied to the target network. This leads to more stable training because it keeps the target function fixed (for a while). -- By using a Convolutional Neural Network as function approximator on raw pixels of Atari games where the score is the reward we can learn to play many those games at human-like performance. +- By using a Convolutional Neural Network as the function approximator on raw pixels of Atari games where the score is the reward we can learn to play many of those games at human-like performance. - Double DQN: Just like regular Q-Learning, DQN tends to overestimate values due to its max operation applied to both selecting and estimating actions. We get around this by using the Q network for selection and the target network for estimation when making updates. @@ -46,4 +46,4 @@ - Double-Q Learning - This is a minimal change to Q-Learning so use the same exercise as above - [Solution](Double DQN Solution.ipynb) -- Prioritized Experience Replay (WIP) \ No newline at end of file +- Prioritized Experience Replay (WIP) diff --git a/FA/README.md b/FA/README.md index bae7f4fa6..befd68de0 100644 --- a/FA/README.md +++ b/FA/README.md @@ -11,12 +11,12 @@ ### Summary - Building a big table, one value for each state or state-action pair, is memory- and data-inefficient. Function Approximation can generalize to unseen states by using a featurized state representation. -- Treat RL as supervised learning problem with the MC- or TD-target as the label and the current state/action as the input. Often the target also depends on the function estimator buy we simply ignore its gradient. That's why these methods are called semi-gradient methods. +- Treat RL as supervised learning problem with the MC- or TD-target as the label and the current state/action as the input. Often the target also depends on the function estimator but we simply ignore its gradient. That's why these methods are called semi-gradient methods. - Challenge: We have non-stationary (policy changes, bootstrapping) and non-iid (correlated in time) data. - Many methods assume that our action space is discrete because they rely on calculating the argmax over all actions. Large and continuous action spaces are ongoing research. - For Control very few convergence guarantees exist. For non-linear approximators there are basically no guarantees at all. But in works in practice. - Experience Replay: Store experience as dataset, randomize it, and repeatedly apply minibatch SGD. -- Tricks to stabilize nonlinear function approximators: Fixed Targets. The target is calculated based on frozen parameter values from a previous time step. +- Tricks to stabilize non-linear function approximators: Fixed Targets. The target is calculated based on frozen parameter values from a previous time step. - For the non-episodic (continuing) case function approximation is more complex and we need to give up discounting and use an "average reward" formulation. diff --git a/Introduction/README.md b/Introduction/README.md index 1eb1ffe0f..c52704709 100644 --- a/Introduction/README.md +++ b/Introduction/README.md @@ -7,7 +7,7 @@ ### Summary -- Reinforcement Learning (RL)is concered with goal-directed learning and decison-making. +- Reinforcement Learning (RL) is concered with goal-directed learning and decison-making. - In RL an agent learns from experiences it gains by interacting with the environment. In Supervised Learning we cannot affect the environment. - In RL rewards are often delayed in time and the agent tries to maximize a long-term goal. For example, one may need to make seemingly suboptimal moves to reach a winning position in a game. - An agents interacts with the environment via states, actions and rewards. diff --git a/MDP/README.md b/MDP/README.md index b94c1a9c4..f92b2a070 100644 --- a/MDP/README.md +++ b/MDP/README.md @@ -10,15 +10,15 @@ ### Summary -- Agent & Environment Interface: At each step t the action receives a state `S_t`, performs an action `A_t` and receives a reward `R_{t+1}`. The action is chosen according to a policy function `pi`. +- Agent & Environment Interface: At each step `t` the agent receives a state `S_t`, performs an action `A_t` and receives a reward `R_{t+1}`. The action is chosen according to a policy function `pi`. - The total return `G_t` is the sum of all rewards starting from time t . Future rewards are discounted at a discount rate `gamma^k`. -- Markov property: The environment's response at time `t+1` depends only on the state and action representations at time `t`. The future is independent of the past given the present. Even if an environment doesn't fully satisfy the Markov property we still treat it as if it did and try to construct the state representation to be approximately Markov. +- Markov property: The environment's response at time `t+1` depends only on the state and action representations at time `t`. The future is independent of the past given the present. Even if an environment doesn't fully satisfy the Markov property we still treat it as if it is and try to construct the state representation to be approximately Markov. - Markov Decision Process (MDP): Defined by a state set S, action set A and one-step dynamics `p(s',r | s,a)`. If we have complete knowledge of the environment we know the transition dynamic. In practice we often don't know the full MDP (but we know that it's some MDP). -- The Value Function `v(s)` estimates how "good" it is for an agent to be in a particular sate. More formally, it's the expected return `G_t` given that the agent is in state s. `vs) = Ex[G_t | S_t = s]`. Note that the value function is specific to a given policy `pi`. +- The Value Function `v(s)` estimates how "good" it is for an agent to be in a particular state. More formally, it's the expected return `G_t` given that the agent is in state `s`. `v(s) = Ex[G_t | S_t = s]`. Note that the value function is specific to a given policy `pi`. - Action Value function: q(s, a) estimates how "good" it is for an agent to be in state s and take action a. Similar to the value function, but also considers the action. - The Bellman equation expresses the relationship between the value of a state and the values of its successor states. It can be expressed using a "backup" diagram. Bellman equations exist for both the value function and the action value function. - Value functions define an ordering over policies. A policy `p1` is better than `p2` if `v_p1(s) >= v_p2(s)` for all states s. For MDPs there exist one or more optimal policies that are better than or equal to all other policies. -- The optimal state value function `v*(s)` is the value function for the optimal policy. Same for `q*(s, a)`. The Bellman Optimality Equation defines how the optimal value of a state related to the optimal value of successor states. It has a "max" instead of an average. +- The optimal state value function `v*(s)` is the value function for the optimal policy. Same for `q*(s, a)`. The Bellman Optimality Equation defines how the optimal value of a state is related to the optimal value of successor states. It has a "max" instead of an average. ### Lectures & Readings @@ -31,4 +31,4 @@ ### Exercises -This chapter is mostly theory so there are no exercises. \ No newline at end of file +This chapter is mostly theory so there are no exercises. diff --git a/PolicyGradient/README.md b/PolicyGradient/README.md index 625688ec4..e4e20a063 100644 --- a/PolicyGradient/README.md +++ b/PolicyGradient/README.md @@ -14,7 +14,7 @@ ### Summary - Idea: Instead of parameterizing the value function and doing greedy policy improvement we parameterize the policy and do gradient descent into a direction that improves it. -- Sometimes the policy is easier to approximate than the value function. Also, we need a parameterized policy to deal with continuous action spaces and environment where we need to act stochastically. +- Sometimes the policy is easier to approximate than the value function. Also, we need a parameterized policy to deal with continuous action spaces and environments where we need to act stochastically. - Policy Score Function `J(theta)`: Intuitively, it measures how good our policy is. For example, we can use the average value or average reward under a policy as our objective. - Common choices for the policy function: Softmax for discrete actions, Gaussian parameters for continuous actions. - Policy Gradient Theorem: `grad(J(theta)) = Ex[grad(log(pi(s, a))) * Q(s, a)]`. Basically, we move our policy into a direction of more reward.