Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: output with shape [256] doesn't match the broadcast shape [256, 256] #234

Open
ajay-vikram opened this issue Jul 30, 2024 · 15 comments

Comments

@ajay-vikram
Copy link

I have trained a Recurrent network using an LSTMCell and MLP layers. But when I load the model and the weights for running the benchmark, I get "RuntimeError: output with shape [256] doesn't match the broadcast shape [256, 256]". Tracing it backwards, it originates from the utils.py file on line 291 (out += biases). On printing the shapes of out and biases, I got [256] and [256, 1] respectively. Squeezing out the 2nd dimension from biases resolves the issue, but I am unsure whether there is a mistake with the benchmark code or with how my model is defined. I faced a similar issue on using a GRUCell. Can I please get some help?
image

@jasonlyik
Copy link
Contributor

Hi Ajay, you may be running with the data shaped differently. We expect that the out tensor is shaped [4*hidden_state, batch_size], so I would expect that out should be shaped [256, 1] and not [256].

At benchmark.py:125 (batch_results[m] = self.workload_metrics[m](self.model, preds, data)), can you please check the shape of preds and data? Otherwise, it may be an issue with the hook connected to the RNNCell which tracks inputs.

Also, there is the LSTM example for a different sequence task here which may be helpful.

@ajay-vikram
Copy link
Author

Hi Jason,
The shapes of pred and data are [256, 2] and ([256, 1, 96], [256, 2]) respectively, where data is a tuple. These are the inputs to my model as well. What shape do you expect as input to the LSTMCell. In my case, a [1, 96] tensor goes to the LSTMCell. This [1, 96] comes from the acc_spikes in the buffering mechanism of the forward pass, similar to the one in primate_example.

@jasonlyik
Copy link
Contributor

The shape is reasonable to me, can you check whether your code matches the code block from this previous issue #225? That works with the latest neurobench package 1.0.6, as well as any arbitrary batch size. If there is still issues, please post your code block so we can inspect the error.

@ajay-vikram
Copy link
Author

Ohh, I see. I didn't get the latest version. How do I get it? Do I run .bumpversion.toml?

@jasonlyik
Copy link
Contributor

jasonlyik commented Jul 31, 2024

pip install --upgrade neurobench

or if you are using poetry and a local cloned repo, then simply git pull on main branch

@ajay-vikram
Copy link
Author

Still getting the same issue. Can you tell which code has been modified. Ill check if the changes have been updated.

@jasonlyik
Copy link
Contributor

Changes are listed in #227

Please check if you can successfully run the minimal example from the code block in #225

If there is still an issue, please provide a minimal example of the model definition and harness call which causes the issue.

@ajay-vikram
Copy link
Author

ajay-vikram commented Jul 31, 2024

Yes the minimal example code runs.

Here's my model definition

class LSTM(nn.Module):
    def __init__(self, input_dim):
        super(LSTM, self).__init__()
        self.input_dim = input_dim
        self.output_dim = 2

        self.lstm = nn.LSTMCell(self.input_dim, 64)
        self.fc1 =  nn.Linear(64, 32)
        self.fc2 = nn.Linear(32, 16)
        self.fc3 = nn.Linear(16, self.output_dim)
        self.layernorm0 = nn.LayerNorm(self.input_dim)
        self.layernorm1 = nn.LayerNorm(32)
        self.layernorm2 = nn.LayerNorm(16)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.3)

        self.bin_window_time = 0.2
        self.sampling_rate = 0.004
        self.bin_window_size = int(self.bin_window_time / self.sampling_rate)
        self.register_buffer("data_buffer", torch.zeros(1, self.input_dim).type(torch.float32), persistent=False)
    
    def single_forward(self,x):
        x = x.unsqueeze(0)
        x = self.layernorm0(x)
        (hn, cn) = self.lstm(x)
        out = self.relu(hn)
        out = self.layernorm1(self.relu(self.fc1(out)))
        out = self.dropout(out)
        out = self.layernorm2(self.relu(self.fc2(out)))
        out = self.fc3(out)
        return out

    def forward(self, x):
        predictions = []

        seq_length = x.shape[0]
        for seq in range(seq_length):
            current_seq = x[seq, :, :]
            self.data_buffer = torch.cat((self.data_buffer, current_seq), dim=0)
            if self.data_buffer.shape[0] <= self.bin_window_size:
                predictions.append(torch.zeros(1, self.output_dim).to(x.device))
            else:
                # Only pass input into model when the buffer size == bin_window_size
                if self.data_buffer.shape[0] > self.bin_window_size:
                    self.data_buffer = self.data_buffer[1:, :]

                # Accumulate
                spikes = self.data_buffer.clone()
                acc_spikes = torch.sum(spikes, dim=0)
                pred = self.single_forward(acc_spikes)
                predictions.append(pred)

        predictions = torch.stack(predictions).squeeze(dim=1)
 
        return predictions

@ajay-vikram
Copy link
Author

This is the benchmark code

import torch
from torch.utils.data import DataLoader, Subset

from neurobench.datasets import PrimateReaching
from neurobench.models.torch_model import TorchModel
from neurobench.benchmarks import Benchmark

from ANN import ANNModel2D
from GRU import GRU
from LSTM import LSTM

all_files = ["indy_20160622_01"]
# all_files = ["indy_20160622_01", "indy_20160630_01", "indy_20170131_02", 
#              "loco_20170210_03", "loco_20170215_02", "loco_20170301_05"]

footprint = []
connection_sparsity = []
activation_sparsity = []
dense = []
macs = []
acs = []
r2 = []

device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")

for filename in all_files:
    print("Processing {}".format(filename))

    # The dataloader and preprocessor has been combined together into a single class
    data_dir = "/home/satyapreets/Ajay/neurobench/neurobench/data" # data in repo root dir
    dataset = PrimateReaching(file_path=data_dir, filename=filename,
                            num_steps=1, train_ratio=0.5, bin_width=0.004,
                            biological_delay=0, remove_segments_inactive=False)

    test_set_loader = DataLoader(Subset(dataset, dataset.ind_test), batch_size=256, shuffle=False)

    net = LSTM(input_dim=dataset.input_feature_size)
    # net = ANNModel2D(input_dim=dataset.input_feature_size, layer1=32, layer2=48, 
    #                  output_dim=2, bin_window=0.2, drop_rate=0.5)

    net.load_state_dict(torch.load("/home/satyapreets/Ajay/neurobench/mobilenet_training/experiments/vww/submission/lstm_64_indy_20160622_01.pt", map_location=device)['state_dict'])
    # net.load_state_dict(torch.load("./model_data/2D_ANN_Weight/"+filename+"_model_state_dict.pth", map_location=device))

    model = TorchModel(net)

    static_metrics = ["footprint", "connection_sparsity"]
    workload_metrics = ["r2", "activation_sparsity", "synaptic_operations"]

    # Benchmark expects the following:
    benchmark = Benchmark(model, test_set_loader, [], [], [static_metrics, workload_metrics])
    results = benchmark.run(device=device)
    print(results)

    footprint.append(results['footprint'])
    connection_sparsity.append(results['connection_sparsity'])
    activation_sparsity.append(results['activation_sparsity'])
    dense.append(results['synaptic_operations']['Dense'])
    macs.append(results['synaptic_operations']['Effective_MACs'])
    acs.append(results['synaptic_operations']['Effective_ACs'])
    r2.append(results['r2'])

print("Footprint: {}".format(footprint))
print("Connection sparsity: {}".format(connection_sparsity))
print("Activation sparsity: {}".format(activation_sparsity), sum(activation_sparsity)/len(activation_sparsity))
print("Dense: {}".format(dense), sum(dense)/len(dense))
print("MACs: {}".format(macs), sum(macs)/len(macs))
print("ACs: {}".format(acs), sum(acs)/len(acs))
print("R2: {}".format(r2), sum(r2)/len(r2))

# Footprint: [20824, 20824, 20824, 33496, 33496, 33496]
# Connection sparsity: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
# Activation sparsity: [0.7068512007122443, 0.7274494314849341, 0.6142621034584272, 0.6290474755671983, 0.6793054885963405, 0.6963649652600741] 0.6755467775132032
# Dense: [4702.261627687736, 4701.8430499148435, 4699.549582947173, 7773.2197567257945, 7771.01773105288, 7772.632844051291] 6236.754098729952
# MACs: [4306.322415210456, 3595.209672287623, 3607.261044176707, 5851.9819915795315, 5995.014802029395, 6462.786839756449] 4969.76279417336
# ACs: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.0
# R2: [0.6327020525932312, 0.5241347551345825, 0.6216747164726257, 0.5727078914642334, 0.4745999276638031, 0.6272222995758057] 0.5755069404840469

@jasonlyik
Copy link
Contributor

Hi Ajay, I noticed that your LSTMCell forward call does not include the (h, c) in the inputs. Based on the documentation, if these are not included, I believe that the recurrent state of the LSTM is not tracked at all, and essentially the LSTM block is just an MLP-type transform. I may be wrong on this, though.

Regardless, note that all of our other LSTM examples use the forward convention for the LSTMCell hx, cx = rnn(input[i], (hx, cx)), and not just hx, cx = rnn(input[i]).

By making additions to your model definition shown in the below code block, there is no longer a harness runtime error:

class LSTM(nn.Module):
    def __init__(self, input_dim):
        super(LSTM, self).__init__()
        self.input_dim = input_dim
        self.output_dim = 2

        self.lstm = nn.LSTMCell(self.input_dim, 64)
        self.fc1 =  nn.Linear(64, 32)
        self.fc2 = nn.Linear(32, 16)
        self.fc3 = nn.Linear(16, self.output_dim)
        self.layernorm0 = nn.LayerNorm(self.input_dim)
        self.layernorm1 = nn.LayerNorm(32)
        self.layernorm2 = nn.LayerNorm(16)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.3)

        self.bin_window_time = 0.2
        self.sampling_rate = 0.004
        self.bin_window_size = int(self.bin_window_time / self.sampling_rate)
        self.register_buffer("data_buffer", torch.zeros(1, self.input_dim).type(torch.float32), persistent=False)

        self.h = None
        self.c = None
    
    def single_forward(self,x):
        x = x.unsqueeze(0)
        x = self.layernorm0(x)
        self.h, self.c = self.lstm(x, (self.h, self.c))
        out = self.relu(self.h)
        out = self.layernorm1(self.relu(self.fc1(out)))
        out = self.dropout(out)
        out = self.layernorm2(self.relu(self.fc2(out)))
        out = self.fc3(out)
        return out

    def forward(self, x):
        predictions = []

        self.h = torch.zeros(1, 64).to(x.device)
        self.c = torch.zeros(1, 64).to(x.device)

        seq_length = x.shape[0]
        for seq in range(seq_length):
            current_seq = x[seq, :, :]
            self.data_buffer = torch.cat((self.data_buffer, current_seq), dim=0)
            if self.data_buffer.shape[0] <= self.bin_window_size:
                predictions.append(torch.zeros(1, self.output_dim).to(x.device))
            else:
                # Only pass input into model when the buffer size == bin_window_size
                if self.data_buffer.shape[0] > self.bin_window_size:
                    self.data_buffer = self.data_buffer[1:, :]

                # Accumulate
                spikes = self.data_buffer.clone()
                acc_spikes = torch.sum(spikes, dim=0)
                pred = self.single_forward(acc_spikes)
                predictions.append(pred)

        predictions = torch.stack(predictions).squeeze(dim=1)
 
        return predictions

The harness should be able to support the case where (h, c) is not passed into the LSTMCell, so this is still an issue. But I recommend that you include (h, c) in the inputs.

@ajay-vikram
Copy link
Author

Aah, I see. I read somewhere in the documentation that LSTMs by default initialize their hidden and cell states to a tensor of 0s, that's why I didn't explicitly add it. Thanks a lot!!

@ajay-vikram
Copy link
Author

ajay-vikram commented Jul 31, 2024

Also will I have to retrain my models with these changes incorporated? I just changed the model but passed the same weights I had before the explicit h and c definition and the neurobench benchmarks are running fine.

@jasonlyik
Copy link
Contributor

My guess is that you will need to retrain the model, as it is now tracking recurrent state and it wasn't before. I suggest that you take out all of the metrics except the R2 workload metric and first verify you are getting the expected accuracy before considering the compute complexity.

@ajay-vikram
Copy link
Author

Alright thanks a lot!

@jasonlyik
Copy link
Contributor

TODO: support synops for RNNCells which do not use recurrent input

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants