Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading dediacritic tool fails due to Emoji library dependency, and tokenizer model_max_length seems incorrect #5

Open
ghost opened this issue Sep 8, 2022 · 0 comments

Comments

@ghost
Copy link

ghost commented Sep 8, 2022

Hello,

  1. The dediacritic tool doesn't seem to work within Google Colab with Python version 3.7. I tried to manually modify the Emoji library but to no result.
from google.colab import output
output.enable_custom_widget_manager()

from google.colab import drive
drive.mount('/content/drive/') 

!pip install camel-tools==1.4.1 -f https://download.pytorch.org/whl/torch_stable.html
os.environ['CAMELTOOLS_DATA'] = '/content/drive/MyDrive/SAAL/EnAr/CAMeL'
!camel_data -i all

from camel_tools.utils.dediac import dediac_ar

image

  1. After loading the tokenizer using Hugging Face's AutoTokenizer, I have to set the tokenizer model_max_length maually to 512, otherwise the value is an extremely large integer > 1e10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

0 participants