Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chap.4. softmax(dim=1) #32

Open
yongduek opened this issue Jun 19, 2021 · 3 comments
Open

Chap.4. softmax(dim=1) #32

yongduek opened this issue Jun 19, 2021 · 3 comments

Comments

@yongduek
Copy link

The code for the model is as below

model = torch.nn.Sequential(
    torch.nn.Linear(l1, l2),
    torch.nn.LeakyReLU(),
    torch.nn.Linear(l2, l3),
    torch.nn.Softmax(dim=0) #C
)

But the softmax operation with dim=0 is only OK when the input is a 1 dimensional array. However, when you give a batch input, then the probability will be computed along the row direction of the batch matrix.

You can check it by printing pred_batch of Listing 4.8.

    pred_batch = model(state_batch) #N
    print(pred_batch)

One way to fix this is by modifying it to:

    torch.nn.Softmax(dim=1) #C

and do unsqueeze(0) and squeeze(0) for the computation of just one state vector:

state1 = env.reset()
pred = model(torch.from_numpy(state1).float().unsqueeze(0)) #G
action = np.random.choice(np.array([0,1]), p=pred.data.numpy().squeeze(0)) #H
state2, reward, done, info = env.step(action) #I

I like this book much since it gives some intuition for RL rather than trying to provide the theory^^

@grisuji
Copy link

grisuji commented Dec 14, 2023

After fixing this issue like above, it turned out, that the learning_rate should be lower for a good learning.
learning_rate=0.001
works quite fine for me.

@Mohammadfathi63
Copy link

Mohammadfathi63 commented Jul 5, 2024

Hi Friends, I can with help from Dr kiaei with very well reinforcement learning course, Correct This Code. I say thank for him. Now Enjoy It
1- Edit dim=0 (coulmn) To dim=1 (row) in Model (Network)
Attention:
sum probability must be =1, if dim=0 in batch run sum probability output Coulmn is 1 and dim=1 in batch run sum probability output rows is 1
that row is Correct because each run model must output is row

2- Add squeeze and unsqueeze In Some Line

3- Edit discount_rewards Function for Make G_1, G_2 , ...

4- in batch must be run Model again that this extra, you can remove it and not run again, but care that update weight is done, In this code I run Model again in Batch mode


!pip install gymnasium

import numpy as np
import torch
import gymnasium as gym
from matplotlib import pyplot as plt

env = gym.make("CartPole-v1")

import numpy as np
import torch

l1 = 4 #A
l2 = 150
l3 = 2 #B

model = torch.nn.Sequential(
    torch.nn.Linear(l1, l2),
    torch.nn.LeakyReLU(),
    torch.nn.Linear(l2, l3),
    torch.nn.Softmax(dim=1) #C
)

learning_rate = 0.001
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

########################################################
state1 = env.reset()[0]
pred = model(torch.from_numpy(state1).float().unsqueeze(0)) #G
action = np.random.choice(np.array([0,1]), p=pred.data.numpy().squeeze(0)) #H
state2, reward, done, info, _ = env.step(action) #I

##########################################################
def discount_rewards2(reward_batch, gamma=0.99, normalize = True):
    # Gt = Rt + g * Rt+1 + g^2 *Rt+2
    # returns = [G_1, G_2, G_3, ... , G_T]
    #example
    # R =[R3,R2,R1]=[3,2,1], g=1
    #G_1 = R3 + g * R2 + g^2 *R1= 6
    #G_2 = R2 + g *R1= 3
    #G_3 = R1 = 1
    batch_Gvals =[]
    for i in range(len(reward_batch)):
        new_Gval=0
        power=0
        for j in range(i,len(reward_batch)):
             new_Gval=new_Gval+((gamma**power)*reward_batch[j]).numpy()
             power+=1
        batch_Gvals.append(new_Gval)
    returns=torch.FloatTensor(batch_Gvals)

    if normalize:
        returns = (returns - returns.mean()) / returns.std()
        #returns /= returns.max()
    return returns
##########################################################
def loss_fn(preds, r): #A
    return -1 * torch.sum(r * torch.log(preds)) #B
##########################################################
MAX_DUR = 200
MAX_EPISODES = 500
gamma = 0.99
score = [] #A
expectation = 0.0
for episode in range(MAX_EPISODES):
    curr_state = env.reset()[0]
    done = False
    transitions = [] #B

    for t in range(MAX_DUR): #C
        act_prob = model(torch.from_numpy(curr_state).float().unsqueeze(0)) #D
        action = np.random.choice(np.array([0,1]), p=act_prob.data.numpy().squeeze(0)) #E
        prev_state = curr_state
        curr_state, _, done, _, info = env.step(action) #F
        transitions.append((prev_state, action, t+1)) #G
        if done: #H
            break

    ep_len = len(transitions) #I
    score.append(ep_len)
    reward_batch = torch.tensor([r for (s,a,r) in transitions]).flip(dims=(0,)) #J
    disc_returns = discount_rewards2(reward_batch,0.99) #K
    state_batch = torch.tensor([s for (s,a,r) in transitions]) #L
    action_batch = torch.tensor([a for (s,a,r) in transitions]) #M
    pred_batch = model(state_batch) #N
    print(pred_batch)
    prob_batch = pred_batch.gather(dim=1,index=action_batch.long().view(-1,1)).squeeze() #O
    loss = loss_fn(prob_batch, disc_returns)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
##################################################################
def running_mean(x, N=50):
    kernel = np.ones(N)
    conv_len = x.shape[0]-N
    y = np.zeros(conv_len)
    for i in range(conv_len):
        y[i] = kernel @ x[i:i+N]
        y[i] /= N
    return y

score = np.array(score)
avg_score = running_mean(score, 50)
plt.figure(figsize=(10,7))
plt.ylabel("Episode Duration",fontsize=22)
plt.xlabel("Training Epochs",fontsize=22)
plt.plot(avg_score, color='green')
#############################################################
score = []
games = 100
done = False
state1 = env.reset()[0]
for i in range(games):
    t=0
    while not done: #F
        if type(state1) is tuple:
            state1 = state1[0]
        pred = model(torch.from_numpy(state1).float().unsqueeze(0)) #G
        action = np.random.choice(np.array([0,1]), p=pred.data.numpy().squeeze(0)) #H
        state2, reward, done, _, info = env.step(action) #I
        state1 = state2
        if(type(state1) == 'tuple'):
            state1 = state2[0]

        t += 1
        if t > MAX_DUR: #L
            break;
    state1 = env.reset()
    done = False
    score.append(t)
score = np.array(score)

plt.scatter(np.arange(score.shape[0]),score)

@Mohammadfathi63
Copy link

Download All Correct Code From This link:
Ch4_Book_CorrectCode_Ver1.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants