jsonl broken, will only read as json #17

carelesswhisp · 2024-03-30T17:38:25Z

any time I try to use the JSONL I get this error
03:29:19-716909 INFO Loading JSONL datasets...
Traceback (most recent call last):
File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/site-packages/gradio/queueing.py", line 407, in call_prediction
output = await route_utils.call_process_api(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/site-packages/gradio/route_utils.py", line 226, in call_process_api
output = await app.get_blocks().process_api(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/site-packages/gradio/blocks.py", line 1550, in process_api
result = await self.call_function(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/site-packages/gradio/blocks.py", line 1199, in call_function
prediction = await utils.async_iteration(iterator)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/site-packages/gradio/utils.py", line 519, in async_iteration
return await iterator.anext()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/site-packages/gradio/utils.py", line 512, in anext
return await anyio.to_thread.run_sync(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/site-packages/anyio/to_thread.py", line 56, in run_sync
return await get_async_backend().run_sync_in_worker_thread(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 2134, in run_sync_in_worker_thread
return await future
^^^^^^^^^^^^
File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 851, in run
result = context.run(func, *args)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/site-packages/gradio/utils.py", line 495, in run_sync_iterator_async
return next(iterator)
^^^^^^^^^^^^^^
File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/site-packages/gradio/utils.py", line 649, in gen_wrapper
yield from f(*args, **kwargs)
File "/media/cher/brains/text-generation-webui/extensions/Training_PRO_wip/script.py", line 466, in check_dataset
loaded_JSONLdata = json.load(dataFile)
^^^^^^^^^^^^^^^^^^^
File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/json/init.py", line 293, in load
return loads(fp.read(),
^^^^^^^^^^^^^^^^
File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/json/init.py", line 346, in loads
return _default_decoder.decode(s)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/cher/brains/text-generation-webui/installer_files/env/lib/python3.11/json/decoder.py", line 340, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 4268)

it's loading all jsonl as json? so the next lines will always cause an error. this seems to be with every model. I've tried so far
GPT2, mistral and lmsys_vicuna

carelesswhisp · 2024-03-30T17:53:17Z

something that can take normal jsonl like gpt would be great, where I can essentially transcribe a show and have the ai take on the personality of a character but have full context of an episode. such as

{"messages": [{"role": "user", "content": "text text text"}, {"role": "assistant", "content": "text text"}, {"role": "user", "content": "text text"},

Sohex · 2024-04-28T06:29:13Z

Jsonl 'works', but the extension needs it to be formatted incorrectly. Wrap the whole thing like an array (e.g.[]) and add commas at the end of all but the last line and it'll work.

To clarify, the correct format for jsonl looks like this:

{"messages": [{"role": "system", "content": "prompt"}, {"role": "user", "content": "input"}, {"role": "assistant", "content": "output"}]}
{"messages": [{"role": "system", "content": "prompt"}, {"role": "user", "content": "input"}, {"role": "assistant", "content": "output"}]}
{"messages": [{"role": "system", "content": "prompt"}, {"role": "user", "content": "input"}, {"role": "assistant", "content": "output"}]}
...

Whereas right now Training_PRO expects:

[
{"messages": [{"role": "system", "content": "prompt"}, {"role": "user", "content": "input"}, {"role": "assistant", "content": "output"}]},
{"messages": [{"role": "system", "content": "prompt"}, {"role": "user", "content": "input"}, {"role": "assistant", "content": "output"}]},
{"messages": [{"role": "system", "content": "prompt"}, {"role": "user", "content": "input"}, {"role": "assistant", "content": "output"}]},
...]

carelesswhisp · 2024-04-28T21:22:09Z

ah gotcha. though i do notice it will now give the error
" raise TemplateError(message)
jinja2.exceptions.TemplateError: Conversation roles must alternate user/assistant/user/assistant/..."

meaning it can't take like a script and format it, can we just modify this template?

Edit:
just to explain this isn't an issue with the training pro (I think) this has to do with the embedded template in the tokenizer. so what ever model you're using will determine the format. vicuna - v1.1 actually seems to work out of the box.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jsonl broken, will only read as json #17

jsonl broken, will only read as json #17

carelesswhisp commented Mar 30, 2024

carelesswhisp commented Mar 30, 2024

Sohex commented Apr 28, 2024 •

edited

Loading

carelesswhisp commented Apr 28, 2024 •

edited

Loading

jsonl broken, will only read as json #17

jsonl broken, will only read as json #17

Comments

carelesswhisp commented Mar 30, 2024

carelesswhisp commented Mar 30, 2024

Sohex commented Apr 28, 2024 • edited Loading

carelesswhisp commented Apr 28, 2024 • edited Loading

Sohex commented Apr 28, 2024 •

edited

Loading

carelesswhisp commented Apr 28, 2024 •

edited

Loading