Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Token Classification fails on value error text_column #813

Open
2 tasks done
jmccrae opened this issue Nov 26, 2024 · 3 comments
Open
2 tasks done

[BUG] Token Classification fails on value error text_column #813

jmccrae opened this issue Nov 26, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@jmccrae
Copy link

jmccrae commented Nov 26, 2024

Prerequisites

  • I have read the documentation.
  • I have checked other issues for similar problems.

Backend

Local

Interface Used

CLI

CLI Command

from autotrain.params import TokenClassificationParams
from autotrain.project import AutoTrainProject


params = TokenClassificationParams(
    model="FacebookAI/roberta-base",
    data_path="data")    

backend = "local"
project = AutoTrainProject(params=params, backend=backend, process=True)
project.create()

UI Screenshots & Parameters

No response

Error Logs

Traceback (most recent call last):
  File "/home/jmccrae/scratch/wikilinks_autotrain/apply_autotrain.py", line 13, in <module>
    project.create()
  File "/home/jmccrae/.cache/pypoetry/virtualenvs/wikilinks-autotrain-NvKt9JsM-py3.12/lib/python3.12/site-packages/autotrain/project.py", line 567, in create
    self.params = self._process_params_data()
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jmccrae/.cache/pypoetry/virtualenvs/wikilinks-autotrain-NvKt9JsM-py3.12/lib/python3.12/site-packages/autotrain/project.py", line 559, in _process_params_data
    return token_clf_munge_data(self.params, self.local)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jmccrae/.cache/pypoetry/virtualenvs/wikilinks-autotrain-NvKt9JsM-py3.12/lib/python3.12/site-packages/autotrain/project.py", line 265, in token_clf_munge_data
    params.text_column = "autotrain_text"
    ^^^^^^^^^^^^^^^^^^
  File "/home/jmccrae/.cache/pypoetry/virtualenvs/wikilinks-autotrain-NvKt9JsM-py3.12/lib/python3.12/site-packages/pydantic/main.py", line 884, in __setattr__
    raise ValueError(f'"{self.__class__.__name__}" object has no field "{name}"')

Additional Information

Data is formatted as in https://huggingface.co/docs/autotrain/en/tasks/token_classification

I also tried commenting out the offending lines and then run into this error

  File "/home/jmccrae/.cache/pypoetry/virtualenvs/wikilinks-autotrain-NvKt9JsM-py3.12/lib/python3.12/site-packages/autotrain/trainers/common.py", line 212, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jmccrae/.cache/pypoetry/virtualenvs/wikilinks-autotrain-NvKt9JsM-py3.12/lib/python3.12/site-packages/autotrain/trainers/token_classification/__main__.py", line 89, in train
    label_list = train_data.features[config.tags_column].feature.names
                 ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
KeyError: 'tags'
@jmccrae jmccrae added the bug Something isn't working label Nov 26, 2024
@abhishekkrthakur
Copy link
Member

could you print column names in your dataset and the output of print(params) ?

@jmccrae
Copy link
Author

jmccrae commented Nov 27, 2024

You can run the reproduction here: https://colab.research.google.com/drive/1shka-nlusipnN6TTAlQPhcXhrvgehNF8?usp=sharing

This is the output of print(params)

{'data_path': 'data', 'model': 'FacebookAI/roberta-base', 'lr': 5e-05, 'epochs': 3, 
'max_seq_length': 128, 'batch_size': 8, 'warmup_ratio': 0.1, 
'gradient_accumulation': 1, 'optimizer': 'adamw_torch', 'scheduler': 'linear', 
'weight_decay': 0.0, 'max_grad_norm': 1.0, 'seed': 42, 'train_split': 'train',
 'valid_split': None, 'tokens_column': 'tokens', 'tags_column': 'tags',
 'logging_steps': -1, 'project_name': 'project-name', 
'auto_find_batch_size': False, 'mixed_precision': None, 'save_total_limit': 1,
 'token': None, 'push_to_hub': False, 'eval_strategy': 'epoch', 'username': None,
 'log': 'none', 'early_stopping_patience': 5, 'early_stopping_threshold': 0.01}

Also, I noted that the CSV on this page is broken as there is a space after the comma that breaks CSV parsing

@abhishekkrthakur
Copy link
Member

fixed.

pip install -U autotrain-advanced

code:

import os

from autotrain.params import TokenClassificationParams
from autotrain.project import AutoTrainProject


if not os.path.exists("data"):
  os.makedirs("data")

with open("data/train.csv", "w") as f:
  print("tokens,tags", file=f)
  print("\"['I', 'love', 'Paris']\",\"['O', 'O', 'B-LOC']\"", file=f)
  print("\"['I', 'live', 'in', 'New', 'York']\",\"['O', 'O', 'O', 'B-LOC', 'I-LOC']\"", file=f)

with open("data/valid.csv", "w") as f:
  print("tokens,tags", file=f)
  print("\"['I', 'love', 'Paris']\",\"['O', 'O', 'B-LOC']\"", file=f)
  print("\"['I', 'live', 'in', 'New', 'York']\",\"['O', 'O', 'O', 'B-LOC', 'I-LOC']\"", file=f)


params = TokenClassificationParams(
    model="FacebookAI/roberta-base",
    data_path="data")

backend = "local"
project = AutoTrainProject(params=params, backend=backend, process=True)
project.create()

Note: ive changed the test filename to valid, otherwise, you need to specify valid_split in params.

apologies for the inconvenience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants