Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jsonl broken, will only read as json #17

Open
carelesswhisp opened this issue Mar 30, 2024 · 3 comments
Open

jsonl broken, will only read as json #17

carelesswhisp opened this issue Mar 30, 2024 · 3 comments

Comments

@carelesswhisp
Copy link

any time I try to use the JSONL I get this error
03:29:19-716909 INFO Loading JSONL datasets...
Traceback (most recent call last):
File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/site-packages/gradio/queueing.py", line 407, in call_prediction
output = await route_utils.call_process_api(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/site-packages/gradio/route_utils.py", line 226, in call_process_api
output = await app.get_blocks().process_api(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/site-packages/gradio/blocks.py", line 1550, in process_api
result = await self.call_function(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/site-packages/gradio/blocks.py", line 1199, in call_function
prediction = await utils.async_iteration(iterator)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/site-packages/gradio/utils.py", line 519, in async_iteration
return await iterator.anext()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/site-packages/gradio/utils.py", line 512, in anext
return await anyio.to_thread.run_sync(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/site-packages/anyio/to_thread.py", line 56, in run_sync
return await get_async_backend().run_sync_in_worker_thread(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 2134, in run_sync_in_worker_thread
return await future
^^^^^^^^^^^^
File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 851, in run
result = context.run(func, *args)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/site-packages/gradio/utils.py", line 495, in run_sync_iterator_async
return next(iterator)
^^^^^^^^^^^^^^
File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/site-packages/gradio/utils.py", line 649, in gen_wrapper
yield from f(*args, **kwargs)
File "/media/cher/brains/text-generation-webui/extensions/Training_PRO_wip/script.py", line 466, in check_dataset
loaded_JSONLdata = json.load(dataFile)
^^^^^^^^^^^^^^^^^^^
File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/json/init.py", line 293, in load
return loads(fp.read(),
^^^^^^^^^^^^^^^^
File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/json/init.py", line 346, in loads
return _default_decoder.decode(s)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/json/decoder.py", line 340, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 4268)

it's loading all jsonl as json? so the next lines will always cause an error. this seems to be with every model. I've tried so far
GPT2, mistral and lmsys_vicuna

@carelesswhisp
Copy link
Author

something that can take normal jsonl like gpt would be great, where I can essentially transcribe a show and have the ai take on the personality of a character but have full context of an episode. such as

{"messages": [{"role": "user", "content": "text text text"}, {"role": "assistant", "content": "text text"}, {"role": "user", "content": "text text"},

@Sohex
Copy link

Sohex commented Apr 28, 2024

Jsonl 'works', but the extension needs it to be formatted incorrectly. Wrap the whole thing like an array (e.g.[]) and add commas at the end of all but the last line and it'll work.

To clarify, the correct format for jsonl looks like this:

{"messages": [{"role": "system", "content": "prompt"}, {"role": "user", "content": "input"}, {"role": "assistant", "content": "output"}]}
{"messages": [{"role": "system", "content": "prompt"}, {"role": "user", "content": "input"}, {"role": "assistant", "content": "output"}]}
{"messages": [{"role": "system", "content": "prompt"}, {"role": "user", "content": "input"}, {"role": "assistant", "content": "output"}]}
...

Whereas right now Training_PRO expects:

[
{"messages": [{"role": "system", "content": "prompt"}, {"role": "user", "content": "input"}, {"role": "assistant", "content": "output"}]},
{"messages": [{"role": "system", "content": "prompt"}, {"role": "user", "content": "input"}, {"role": "assistant", "content": "output"}]},
{"messages": [{"role": "system", "content": "prompt"}, {"role": "user", "content": "input"}, {"role": "assistant", "content": "output"}]},
...]

@carelesswhisp
Copy link
Author

carelesswhisp commented Apr 28, 2024

ah gotcha. though i do notice it will now give the error
" raise TemplateError(message)
jinja2.exceptions.TemplateError: Conversation roles must alternate user/assistant/user/assistant/..."

meaning it can't take like a script and format it, can we just modify this template?

Edit:
just to explain this isn't an issue with the training pro (I think) this has to do with the embedded template in the tokenizer. so what ever model you're using will determine the format. vicuna - v1.1 actually seems to work out of the box.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants