RuntimeError: probability tensor contains either `inf`, `nan` or element < 0 #3337

samy-rashed03 · 2023-06-06T18:26:52Z

samy-rashed03
Jun 6, 2023

Hey. So I'm trying to make an OpenAssistant API, in order to use OpenAssistant as a fallback for a chatbot I'm trying to make (I'm using IBM Watson for the chatbot for what it's worth). To do so, I'm trying to get the Pythia 12B model (OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5) up and running on a cloud GPU on Google Cloud. I'm using a NVIDIA L4 GPU, and the machine I'm using has 16 vCPUs and 64 GB memory.

Here's the current code I have for my API right now

from flask import Flask, request, jsonify
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import os
import pdb

app = Flask(__name__)
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

MODEL_NAME = "/home/bautista0848/text-generation-webui/models/OpenAssistant_oasst-sft-4-pythia-12b-epoch-3.5"

# pdb.set_trace()
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map="auto", load_in_8bit=True)

# Get max context length and the determine cushion for response
MAX_CONTEXT_LENGTH = model.config.max_position_embeddings
print(f"Max context length: {MAX_CONTEXT_LENGTH}")
ROOM_FOR_RESPONSE = 512

model = model.cuda()


@app.route('/generate', methods=['POST'])
def generate():
    content = request.json
    inp = content.get("text", "")
    
    # Generate both input_ids and attention_mask
    encoded_inputs = tokenizer.encode_plus(inp, return_tensors="pt", padding='max_length', max_length=MAX_CONTEXT_LENGTH, truncation=True)
    input_ids = encoded_inputs['input_ids']
    attention_mask = encoded_inputs['attention_mask']

    # Calc current size
    print("Context length is currently", input_ids.shape[1], "tokens. Allowed amount is", MAX_CONTEXT_LENGTH-ROOM_FOR_RESPONSE, "tokens.")
    # determine if we need to trim
    if input_ids.shape[1] > (MAX_CONTEXT_LENGTH-ROOM_FOR_RESPONSE):
        print("Trimming a bit")
        # trim as needed AT the first dimension
        input_ids = input_ids[:, -(MAX_CONTEXT_LENGTH-ROOM_FOR_RESPONSE):]
        attention_mask = attention_mask[:, -(MAX_CONTEXT_LENGTH-ROOM_FOR_RESPONSE):] # trim attention_mask as well

    input_ids = input_ids.cuda()
    attention_mask = attention_mask.cuda()

    print("Min Value in input_ids: ", torch.min(input_ids))
    print("Max Value in input_ids: ", torch.max(input_ids))
    print("Contains NaN: ", torch.isnan(input_ids).any())
    print("Contains Inf: ", torch.isinf(input_ids).any())

    print("Min Value in attention_mask: ", torch.min(attention_mask))
    print("Max Value in attention_mask: ", torch.max(attention_mask))
    print("Contains NaN: ", torch.isnan(attention_mask).any())
    print("Contains Inf: ", torch.isinf(attention_mask).any())


    with torch.cuda.amp.autocast():
        output = model.generate(input_ids, attention_mask=attention_mask, max_length=2048, do_sample=True, early_stopping=True, num_return_sequences=1, eos_token_id=model.config.eos_token_id)

    decoded = tokenizer.decode(output[0], skip_special_tokens=False)

    return jsonify({'generated_text': decoded})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)  # Set the host to '0.0.0.0' to make it accessible from your local network

I also created a file to test the API to see if it's working, it can be seen below

import requests
import json
import colorama

SERVER_IP = "10.128.0.2"
URL = f"http://{SERVER_IP}:5000/generate"

USERTOKEN = "<|prompter|>"
ENDTOKEN = "<|endoftext|>"
ASSISTANTTOKEN = "<|assistant|>"

def prompt(inp):
    data = {"text": inp}
    headers = {'Content-type': 'application/json'}

    response = requests.post(URL, data=json.dumps(data), headers=headers)

    if response.status_code == 200:
        return response.json()["generated_text"]
    else:
        return "Error:", response.status_code
    
history = ""
while True:
    inp = input(">>> ")
    context = history + USERTOKEN + inp + ENDTOKEN + ASSISTANTTOKEN
    output = prompt(context)
    if isinstance(output, tuple):  # handle the error case
        print(f"Error: {output[1]}")
    else:
        history = output
        just_latest_asst_output = output.split(ASSISTANTTOKEN)[-1].split(ENDTOKEN)[0]
        # color just_latest_asst_output green in print:
        print(colorama.Fore.GREEN + just_latest_asst_output + colorama.Style.RESET_ALL)

The logs I'm getting for the error can be found below:

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
/home/bautista0848/text-generation-webui/venv2/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:318: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization
  warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
[2023-06-06 18:02:26,392] ERROR in app: Exception on /generate [POST]
Traceback (most recent call last):
  File "/home/bautista0848/text-generation-webui/venv2/lib/python3.10/site-packages/flask/app.py", line 2190, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/bautista0848/text-generation-webui/venv2/lib/python3.10/site-packages/flask/app.py", line 1486, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/bautista0848/text-generation-webui/venv2/lib/python3.10/site-packages/flask/app.py", line 1484, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/bautista0848/text-generation-webui/venv2/lib/python3.10/site-packages/flask/app.py", line 1469, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/home/bautista0848/text-generation-webui/app.py", line 58, in generate
    output = model.generate(input_ids, attention_mask=attention_mask, max_length=2048, do_sample=True, early_stopping=True, num_return_sequences=1, eos_token_id=model.config.eos_token_id)
  File "/home/bautista0848/text-generation-webui/venv2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/bautista0848/text-generation-webui/venv2/lib/python3.10/site-packages/transformers/generation/utils.py", line 1568, in generate
    return self.sample(
  File "/home/bautista0848/text-generation-webui/venv2/lib/python3.10/site-packages/transformers/generation/utils.py", line 2651, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

I have tried to debug what's going on by printing the values of both my "input_ids" and "attention_mask" tensors, as demonstrated from a snippet of my API code

    input_ids = input_ids.cuda()
    attention_mask = attention_mask.cuda()

    print("Min Value in input_ids: ", torch.min(input_ids))
    print("Max Value in input_ids: ", torch.max(input_ids))
    print("Contains NaN: ", torch.isnan(input_ids).any())
    print("Contains Inf: ", torch.isinf(input_ids).any())

    print("Min Value in attention_mask: ", torch.min(attention_mask))
    print("Max Value in attention_mask: ", torch.max(attention_mask))
    print("Contains NaN: ", torch.isnan(attention_mask).any())
    print("Contains Inf: ", torch.isinf(attention_mask).any())

The output I get is

Min Value in input_ids:  tensor(1, device='cuda:0')
Max Value in input_ids:  tensor(1, device='cuda:0')
Contains NaN:  tensor(False, device='cuda:0')
Contains Inf:  tensor(False, device='cuda:0')
Min Value in attention_mask:  tensor(0, device='cuda:0')
Max Value in attention_mask:  tensor(0, device='cuda:0')
Contains NaN:  tensor(False, device='cuda:0')
Contains Inf:  tensor(False, device='cuda:0')

Now I don't think that the mins and maxes of either of my tensors should be the same, nor should the values be strictly '0' or '1', so I'm led to believe that something is wrong when transferring my values to the GPU. If anyone can please help me out, I would gladly appreciate it!

andreaskoepf · 2023-06-07T08:17:16Z

andreaskoepf
Jun 7, 2023
Maintainer

Sounds to me like an 8-bit quantization problem. If you have sufficient GPU memory please try loading in 16 bit. Could you try if the error also occurs without any padding (for single element batches you don't need to pad the input) .. if you pad you need to use left padding! Please try if using remove_invalid_values=True for generate() helps.

0 replies

samy-rashed03 · 2023-06-07T13:16:29Z

samy-rashed03
Jun 7, 2023
Author

Sounds to me like an 8-bit quantization problem. If you have sufficient GPU memory please try loading in 16 bit. Could you try if the error also occurs without any padding (for single element batches you don't need to pad the input) .. if you pad you need to use left padding! Please try if using remove_invalid_values=True for generate() helps.

I assume 16-bit is the default quantization. If so, I'm trying to load it in 8-bit because I run into CUDA out-of-memory errors when trying to load up the model under my current configurations otherwise. When I loaded up this exact model with the exact same cloud GPU with 8-bit (along with splitting the model across my GPU and CPU) with text-generation-webui (a GUI used for running LLMs like OpenAssistant), it worked fine, but I suppose it's different when trying to build an API. If your other suggestions don't work out, I might have to switch GPUs after all..

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0 #3337

{{title}}

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

RuntimeError: probability tensor contains either inf, nan or element < 0 #3337

samy-rashed03 Jun 6, 2023

Replies: 2 comments

andreaskoepf Jun 7, 2023 Maintainer

samy-rashed03 Jun 7, 2023 Author

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0 #3337

samy-rashed03
Jun 6, 2023

andreaskoepf
Jun 7, 2023
Maintainer

samy-rashed03
Jun 7, 2023
Author