Chap.4. softmax(dim=1) #32

yongduek · 2021-06-19T12:52:48Z

The code for the model is as below

model = torch.nn.Sequential(
    torch.nn.Linear(l1, l2),
    torch.nn.LeakyReLU(),
    torch.nn.Linear(l2, l3),
    torch.nn.Softmax(dim=0) #C
)

But the softmax operation with dim=0 is only OK when the input is a 1 dimensional array. However, when you give a batch input, then the probability will be computed along the row direction of the batch matrix.

You can check it by printing pred_batch of Listing 4.8.

    pred_batch = model(state_batch) #N
    print(pred_batch)

One way to fix this is by modifying it to:

    torch.nn.Softmax(dim=1) #C

and do unsqueeze(0) and squeeze(0) for the computation of just one state vector:

state1 = env.reset()
pred = model(torch.from_numpy(state1).float().unsqueeze(0)) #G
action = np.random.choice(np.array([0,1]), p=pred.data.numpy().squeeze(0)) #H
state2, reward, done, info = env.step(action) #I

I like this book much since it gives some intuition for RL rather than trying to provide the theory^^

The text was updated successfully, but these errors were encountered:

grisuji · 2023-12-14T15:33:12Z

After fixing this issue like above, it turned out, that the learning_rate should be lower for a good learning.
learning_rate=0.001
works quite fine for me.

Mohammadfathi63 · 2024-07-05T04:53:12Z

Hi Friends, I can with help from Dr kiaei with very well reinforcement learning course, Correct This Code. I say thank for him. Now Enjoy It
1- Edit dim=0 (coulmn) To dim=1 (row) in Model (Network)
Attention:
sum probability must be =1, if dim=0 in batch run sum probability output Coulmn is 1 and dim=1 in batch run sum probability output rows is 1
that row is Correct because each run model must output is row

2- Add squeeze and unsqueeze In Some Line

3- Edit discount_rewards Function for Make G_1, G_2 , ...

4- in batch must be run Model again that this extra, you can remove it and not run again, but care that update weight is done, In this code I run Model again in Batch mode

!pip install gymnasium

import numpy as np
import torch
import gymnasium as gym
from matplotlib import pyplot as plt

env = gym.make("CartPole-v1")

import numpy as np
import torch

l1 = 4 #A
l2 = 150
l3 = 2 #B

model = torch.nn.Sequential(
    torch.nn.Linear(l1, l2),
    torch.nn.LeakyReLU(),
    torch.nn.Linear(l2, l3),
    torch.nn.Softmax(dim=1) #C
)

learning_rate = 0.001
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

########################################################
state1 = env.reset()[0]
pred = model(torch.from_numpy(state1).float().unsqueeze(0)) #G
action = np.random.choice(np.array([0,1]), p=pred.data.numpy().squeeze(0)) #H
state2, reward, done, info, _ = env.step(action) #I

##########################################################
def discount_rewards2(reward_batch, gamma=0.99, normalize = True):
    # Gt = Rt + g * Rt+1 + g^2 *Rt+2
    # returns = [G_1, G_2, G_3, ... , G_T]
    #example
    # R =[R3,R2,R1]=[3,2,1], g=1
    #G_1 = R3 + g * R2 + g^2 *R1= 6
    #G_2 = R2 + g *R1= 3
    #G_3 = R1 = 1
    batch_Gvals =[]
    for i in range(len(reward_batch)):
        new_Gval=0
        power=0
        for j in range(i,len(reward_batch)):
             new_Gval=new_Gval+((gamma**power)*reward_batch[j]).numpy()
             power+=1
        batch_Gvals.append(new_Gval)
    returns=torch.FloatTensor(batch_Gvals)

    if normalize:
        returns = (returns - returns.mean()) / returns.std()
        #returns /= returns.max()
    return returns
##########################################################
def loss_fn(preds, r): #A
    return -1 * torch.sum(r * torch.log(preds)) #B
##########################################################
MAX_DUR = 200
MAX_EPISODES = 500
gamma = 0.99
score = [] #A
expectation = 0.0
for episode in range(MAX_EPISODES):
    curr_state = env.reset()[0]
    done = False
    transitions = [] #B

    for t in range(MAX_DUR): #C
        act_prob = model(torch.from_numpy(curr_state).float().unsqueeze(0)) #D
        action = np.random.choice(np.array([0,1]), p=act_prob.data.numpy().squeeze(0)) #E
        prev_state = curr_state
        curr_state, _, done, _, info = env.step(action) #F
        transitions.append((prev_state, action, t+1)) #G
        if done: #H
            break

    ep_len = len(transitions) #I
    score.append(ep_len)
    reward_batch = torch.tensor([r for (s,a,r) in transitions]).flip(dims=(0,)) #J
    disc_returns = discount_rewards2(reward_batch,0.99) #K
    state_batch = torch.tensor([s for (s,a,r) in transitions]) #L
    action_batch = torch.tensor([a for (s,a,r) in transitions]) #M
    pred_batch = model(state_batch) #N
    print(pred_batch)
    prob_batch = pred_batch.gather(dim=1,index=action_batch.long().view(-1,1)).squeeze() #O
    loss = loss_fn(prob_batch, disc_returns)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
##################################################################
def running_mean(x, N=50):
    kernel = np.ones(N)
    conv_len = x.shape[0]-N
    y = np.zeros(conv_len)
    for i in range(conv_len):
        y[i] = kernel @ x[i:i+N]
        y[i] /= N
    return y

score = np.array(score)
avg_score = running_mean(score, 50)
plt.figure(figsize=(10,7))
plt.ylabel("Episode Duration",fontsize=22)
plt.xlabel("Training Epochs",fontsize=22)
plt.plot(avg_score, color='green')
#############################################################
score = []
games = 100
done = False
state1 = env.reset()[0]
for i in range(games):
    t=0
    while not done: #F
        if type(state1) is tuple:
            state1 = state1[0]
        pred = model(torch.from_numpy(state1).float().unsqueeze(0)) #G
        action = np.random.choice(np.array([0,1]), p=pred.data.numpy().squeeze(0)) #H
        state2, reward, done, _, info = env.step(action) #I
        state1 = state2
        if(type(state1) == 'tuple'):
            state1 = state2[0]

        t += 1
        if t > MAX_DUR: #L
            break;
    state1 = env.reset()
    done = False
    score.append(t)
score = np.array(score)

plt.scatter(np.arange(score.shape[0]),score)

Mohammadfathi63 · 2024-07-05T05:00:24Z

Download All Correct Code From This link:
Ch4_Book_CorrectCode_Ver1.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chap.4. softmax(dim=1) #32

Chap.4. softmax(dim=1) #32

yongduek commented Jun 19, 2021

grisuji commented Dec 14, 2023

Mohammadfathi63 commented Jul 5, 2024 •

edited

Loading

Mohammadfathi63 commented Jul 5, 2024

Chap.4. softmax(dim=1) #32

Chap.4. softmax(dim=1) #32

Comments

yongduek commented Jun 19, 2021

grisuji commented Dec 14, 2023

Mohammadfathi63 commented Jul 5, 2024 • edited Loading

Mohammadfathi63 commented Jul 5, 2024

Mohammadfathi63 commented Jul 5, 2024 •

edited

Loading