Skip to content

Commit

Permalink
Merge pull request dennybritz#17 from absolutelyNoWarranty/fix-typos
Browse files Browse the repository at this point in the history
Fix simple typos in README's
  • Loading branch information
dennybritz authored Oct 31, 2016
2 parents ccc39c9 + 1fe4185 commit 23b6930
Show file tree
Hide file tree
Showing 6 changed files with 14 additions and 14 deletions.
4 changes: 2 additions & 2 deletions DP/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
### Summary

- Dynamic Programming (DP) methods assume that we have a perfect model of the environment's Markov Decision Process (MDP). That's usually not the case in practice, but it's important to study DP anyway.
- Policy Evaluation: Calculates the state-value function V(s) for a given policy. In DP this is done using a "full backup". At each state we look ahead one step at each possible action and next state. We can only do this because we have a perfect model of the environment.
- Policy Evaluation: Calculates the state-value function `V(s)` for a given policy. In DP this is done using a "full backup". At each state we look ahead one step at each possible action and next state. We can only do this because we have a perfect model of the environment.
- Full backups are basically the Bellman equations turned into updates.
- Policy Improvement: Given the correct state-value function for a policy we can act greedily with respect to it (i.e. pick the best action at each state). Then we are guaranteed to improve the policy or keep it fixed if it's already optimal.
- Policy Iteration: Iteratively perform Policy Evaluation and Policy Improvement until we reach the optimal policy.
Expand Down Expand Up @@ -43,4 +43,4 @@

- Implement Value Iteration in Python (Gridworld)
- [Exercise](Value Iteration.ipynb)
- [Solution](Value Iteration Solution.ipynb)
- [Solution](Value Iteration Solution.ipynb)
6 changes: 3 additions & 3 deletions DQN/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,10 @@
### Summary

- DQN: Q-Learning but with a Deep Neural Network as a function approximator.
- Using a nolinear Deep Neural Network is powerful, but training is unstable if we apply it naively.
- Using a non-linear Deep Neural Network is powerful, but training is unstable if we apply it naively.
- Trick 1 - Experience Replay: Store experience `(S, A, R, S_next)` in a replay buffer and sample minibatches from it to train the network. This decorrelates the data and leads to better data efficiency. In the beginning the replay buffer is filled with random experience.
- Trick 2 - Target Network: Use a separate network to estimate the TD target. This target network has the same architecture as the function approximator but with frozen parameters. Every T steps (a hyperparameter) the parameters from the Q network are copied to the target network. This leads to more stable training because it keeps the target function fixed (for a while).
- By using a Convolutional Neural Network as function approximator on raw pixels of Atari games where the score is the reward we can learn to play many those games at human-like performance.
- By using a Convolutional Neural Network as the function approximator on raw pixels of Atari games where the score is the reward we can learn to play many of those games at human-like performance.
- Double DQN: Just like regular Q-Learning, DQN tends to overestimate values due to its max operation applied to both selecting and estimating actions. We get around this by using the Q network for selection and the target network for estimation when making updates.


Expand Down Expand Up @@ -46,4 +46,4 @@
- Double-Q Learning
- This is a minimal change to Q-Learning so use the same exercise as above
- [Solution](Double DQN Solution.ipynb)
- Prioritized Experience Replay (WIP)
- Prioritized Experience Replay (WIP)
4 changes: 2 additions & 2 deletions FA/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,12 @@
### Summary

- Building a big table, one value for each state or state-action pair, is memory- and data-inefficient. Function Approximation can generalize to unseen states by using a featurized state representation.
- Treat RL as supervised learning problem with the MC- or TD-target as the label and the current state/action as the input. Often the target also depends on the function estimator buy we simply ignore its gradient. That's why these methods are called semi-gradient methods.
- Treat RL as supervised learning problem with the MC- or TD-target as the label and the current state/action as the input. Often the target also depends on the function estimator but we simply ignore its gradient. That's why these methods are called semi-gradient methods.
- Challenge: We have non-stationary (policy changes, bootstrapping) and non-iid (correlated in time) data.
- Many methods assume that our action space is discrete because they rely on calculating the argmax over all actions. Large and continuous action spaces are ongoing research.
- For Control very few convergence guarantees exist. For non-linear approximators there are basically no guarantees at all. But in works in practice.
- Experience Replay: Store experience as dataset, randomize it, and repeatedly apply minibatch SGD.
- Tricks to stabilize nonlinear function approximators: Fixed Targets. The target is calculated based on frozen parameter values from a previous time step.
- Tricks to stabilize non-linear function approximators: Fixed Targets. The target is calculated based on frozen parameter values from a previous time step.
- For the non-episodic (continuing) case function approximation is more complex and we need to give up discounting and use an "average reward" formulation.


Expand Down
2 changes: 1 addition & 1 deletion Introduction/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

### Summary

- Reinforcement Learning (RL)is concered with goal-directed learning and decison-making.
- Reinforcement Learning (RL) is concered with goal-directed learning and decison-making.
- In RL an agent learns from experiences it gains by interacting with the environment. In Supervised Learning we cannot affect the environment.
- In RL rewards are often delayed in time and the agent tries to maximize a long-term goal. For example, one may need to make seemingly suboptimal moves to reach a winning position in a game.
- An agents interacts with the environment via states, actions and rewards.
Expand Down
10 changes: 5 additions & 5 deletions MDP/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,15 @@

### Summary

- Agent & Environment Interface: At each step t the action receives a state `S_t`, performs an action `A_t` and receives a reward `R_{t+1}`. The action is chosen according to a policy function `pi`.
- Agent & Environment Interface: At each step `t` the agent receives a state `S_t`, performs an action `A_t` and receives a reward `R_{t+1}`. The action is chosen according to a policy function `pi`.
- The total return `G_t` is the sum of all rewards starting from time t . Future rewards are discounted at a discount rate `gamma^k`.
- Markov property: The environment's response at time `t+1` depends only on the state and action representations at time `t`. The future is independent of the past given the present. Even if an environment doesn't fully satisfy the Markov property we still treat it as if it did and try to construct the state representation to be approximately Markov.
- Markov property: The environment's response at time `t+1` depends only on the state and action representations at time `t`. The future is independent of the past given the present. Even if an environment doesn't fully satisfy the Markov property we still treat it as if it is and try to construct the state representation to be approximately Markov.
- Markov Decision Process (MDP): Defined by a state set S, action set A and one-step dynamics `p(s',r | s,a)`. If we have complete knowledge of the environment we know the transition dynamic. In practice we often don't know the full MDP (but we know that it's some MDP).
- The Value Function `v(s)` estimates how "good" it is for an agent to be in a particular sate. More formally, it's the expected return `G_t` given that the agent is in state s. `vs) = Ex[G_t | S_t = s]`. Note that the value function is specific to a given policy `pi`.
- The Value Function `v(s)` estimates how "good" it is for an agent to be in a particular state. More formally, it's the expected return `G_t` given that the agent is in state `s`. `v(s) = Ex[G_t | S_t = s]`. Note that the value function is specific to a given policy `pi`.
- Action Value function: q(s, a) estimates how "good" it is for an agent to be in state s and take action a. Similar to the value function, but also considers the action.
- The Bellman equation expresses the relationship between the value of a state and the values of its successor states. It can be expressed using a "backup" diagram. Bellman equations exist for both the value function and the action value function.
- Value functions define an ordering over policies. A policy `p1` is better than `p2` if `v_p1(s) >= v_p2(s)` for all states s. For MDPs there exist one or more optimal policies that are better than or equal to all other policies.
- The optimal state value function `v*(s)` is the value function for the optimal policy. Same for `q*(s, a)`. The Bellman Optimality Equation defines how the optimal value of a state related to the optimal value of successor states. It has a "max" instead of an average.
- The optimal state value function `v*(s)` is the value function for the optimal policy. Same for `q*(s, a)`. The Bellman Optimality Equation defines how the optimal value of a state is related to the optimal value of successor states. It has a "max" instead of an average.


### Lectures & Readings
Expand All @@ -31,4 +31,4 @@

### Exercises

This chapter is mostly theory so there are no exercises.
This chapter is mostly theory so there are no exercises.
2 changes: 1 addition & 1 deletion PolicyGradient/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
### Summary

- Idea: Instead of parameterizing the value function and doing greedy policy improvement we parameterize the policy and do gradient descent into a direction that improves it.
- Sometimes the policy is easier to approximate than the value function. Also, we need a parameterized policy to deal with continuous action spaces and environment where we need to act stochastically.
- Sometimes the policy is easier to approximate than the value function. Also, we need a parameterized policy to deal with continuous action spaces and environments where we need to act stochastically.
- Policy Score Function `J(theta)`: Intuitively, it measures how good our policy is. For example, we can use the average value or average reward under a policy as our objective.
- Common choices for the policy function: Softmax for discrete actions, Gaussian parameters for continuous actions.
- Policy Gradient Theorem: `grad(J(theta)) = Ex[grad(log(pi(s, a))) * Q(s, a)]`. Basically, we move our policy into a direction of more reward.
Expand Down

0 comments on commit 23b6930

Please sign in to comment.