Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with coref #45

Open
idan-h opened this issue Jan 30, 2024 · 25 comments
Open

Error with coref #45

idan-h opened this issue Jan 30, 2024 · 25 comments

Comments

@idan-h
Copy link

idan-h commented Jan 30, 2024

I try to run coreference resolution by
python heb_pipe.py -c example_in.txt

but I get this:

# text = עיפרון עיפרון הוא כלי כתיבה ידני לשם כתיבה וציור, לרוב על דפי נייר. העיפרון מורכב ממוט גרפיט, אשר לרוב מצופה בעץ. רכות הגרפיט מובילה לכך שבהשתפשפו בנייר הוא משאיר עליו פירורים זערוריים המהווים את רישום העיפרון. העיפרון הצבעוני מכיל פיגמנט. העיפרון נבדל ממרבית כלי הכתיבה (כמו למשל עטים, צבעי פנדה) בכך שניתן למחוק את תוצריו. לעיתים קרובות נמצא בקצהו האחד של העיפרון מחק. בעיפרון ממוצע ניתן לכתוב כ־50,000 מילים לפני שהוא נגמר. במהלך השימוש בעיפרון נהוג לחדדו באמצעות מחדד.
1	עיפרון	_	_	_	_	0	_
2	עיפרון הוא כלי כתיבה ידני לשם כתיבה וציור, לרוב על דפי נייר. העיפרון מורכב ממוט גרפיט, אשר לרוב מצופה בעץ. רכות הגרפיט מובילה לכך שבהשתפשפו בנייר הוא משאיר עליו פירורים זערוריים המהווים את רישום העיפרון. העיפרון הצבעוני מכיל פיגמנט. העיפרון נבדל ממרבית כלי הכתיבה (כמו למשל עטים, צבעי פנדה) בכך שניתן למחוק את תוצריו. לעיתים קרובות נמצא בקצהו האחד של העיפרון מחק. בעיפרון ממוצע ניתן לכתוב כ־50,000 מילים לפני שהוא נגמר. במהלך השימוש בעיפרון נהוג לחדדו באמצעות מחדד.	_	_	_	_	0	_

which I guess does not mean much

@amir-zeldes
Copy link
Owner

Hi @idan-h - it looks like you are running the system on unsegmented text, but you are not asking for segmentation, so it assumes the text is already segmented. As a result it just treats the whole thing as one giant word and there is nothing to do coref on.

Since the text is unanalyzed, can you try running it with the full pipeline like this?

python heb_pipe.py -wtpldec example_in.txt

@idan-h
Copy link
Author

idan-h commented Jan 30, 2024

Hi @idan-h - it looks like you are running the system on unsegmented text, but you are not asking for segmentation, so it assumes the text is already segmented. As a result it just treats the whole thing as one giant word and there is nothing to do coref on.

Since the text is unanalyzed, can you try running it with the full pipeline like this?

python heb_pipe.py -wtpldec example_in.txt

HebPipe\hebpipe>python heb_pipe.py -wtpldec example_in.txt

Running tasks:
====================
o Automatic sentence splitting (neural)
o Whitespace tokenization
o Morphological segmentation
o POS and Morphological tagging
o Lemmatization
o Dependency parsing
o Entity recognition
o Coreference resolution

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.7.0.json: 370kB [00:00, 23.7MB/s]
2024-01-30 03:26:13 WARNING: GPU requested, but is not available!
Some weights of BertModel were not initialized from the model checkpoint at onlplab/alephbert-base and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertModel were not initialized from the model checkpoint at onlplab/alephbert-base and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using bos_token, but it is not set yet.
Using eos_token, but it is not set yet.
Processing example_in.txt
Traceback (most recent call last):
  File "heb_pipe.py", line 851, in <module>
    run_hebpipe()
  File "heb_pipe.py", line 828, in run_hebpipe
    processed = nlp(input_text, do_whitespace=opts.whitespace, do_tok=dotok, do_tag=opts.posmorph, do_lemma=opts.lemma,
  File "heb_pipe.py", line 604, in nlp
    tokenized = rf_tok.rf_tokenize(data.strip().split("\n"))
  File "venv\lib\site-packages\rftokenizer\tokenize_rf.py", line 924, in rf_tokenize
    self.load()
  File "venv\lib\site-packages\rftokenizer\tokenize_rf.py", line 540, in load
    self.bert = FlairTagger(seg=True)
  File "venv\lib\site-packages\rftokenizer\flair_pos_tagger.py", line 45, in __init__
    self.model = SequenceTagger.load(model_dir + lang_prefix + ".seg")
  File "venv\lib\site-packages\flair\nn.py", line 88, in load
    state = torch.load(f, map_location='cpu')
  File "venv\lib\site-packages\torch\serialization.py", line 577, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
  File "venv\lib\site-packages\torch\serialization.py", line 241, in __init__
    super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: [enforce fail at ..\caffe2\serialize\inline_container.cc:144] . PytorchStreamReader failed reading zip archive: failed finding central directory
Elapsed time: 0:00:16.171
========================================

@amir-zeldes
Copy link
Owner

Hm, it looks like the conflict is Stanza. Stanza 1.7.0 is pretty new, so maybe downgrading will solve it. At any rate I can confirm that Stanza 1.1.0 works, so that's worth a try.

@amir-zeldes
Copy link
Owner

Oh wait, scratch that, I misread it - actually it looks like the sequence tagger for the segmenter has a corrupt model file. Can you delete heb.seg and redownload it from here?

https://gucorpling.org/amir/download/heb_models_v3/heb.seg

@idan-h
Copy link
Author

idan-h commented Jan 30, 2024

Oh wait, scratch that, I misread it - actually it looks like the sequence tagger for the segmenter has a corrupt model file. Can you delete heb.seg and redownload it from here?

https://gucorpling.org/amir/download/heb_models_v3/heb.seg

I can't seem to find heb.seg

@amir-zeldes
Copy link
Owner

Since you're in a venv you can also wipe it out and start a new one, but I would assume you'll find it in:

venv\lib\site-packages\hebpipe\models\

I'm not sure how it got corrupted other than maybe a bad connection, but based on the error message it looks like the model file you have is an incomplete archive, see:

https://stackoverflow.com/questions/71617570/pytorchstreamreader-failed-reading-zip-archive-failed-finding-central-directory

@idan-h
Copy link
Author

idan-h commented Jan 30, 2024

Since you're in a venv you can also wipe it out and start a new one, but I would assume you'll find it in:

venv\lib\site-packages\hebpipe\models\

I'm not sure how it got corrupted other than maybe a bad connection, but based on the error message it looks like the model file you have is an incomplete archive, see:

https://stackoverflow.com/questions/71617570/pytorchstreamreader-failed-reading-zip-archive-failed-finding-central-directory

I cloned the git.
Deleted all the models, redownloaded. Deleted %userprofile%.cache\torch, redownloaded.

Still happenes.

Where is this file that needs to be deleted ><

Redoing the installation would be a nightmare. I think this must be a docker container, there is no other option

btw, I tried fresh installation with pip, still happenes.

@amir-zeldes
Copy link
Owner

I'm not sure what you mean about docker, we didn't inclulde one - or did someone set up a container for it?

In any case, the model should be downloaded automatically by the software the first time it attempts to run segmentation - it should get downloaded into wherever lib/site-packages is for the python in question (under venv if it's a venv). You can see the line of code that downloads it in the RFTokenizer dependency here so you can possibly try debugging it in an IDE and see why the download won't complete correctly on your connection:

https://github.com/amir-zeldes/RFTokenizer/blob/master/rftokenizer/flair_pos_tagger.py#L43

Actually it may make sense to just pip install rftokenizer first (that's just the segmenter as a standalone library) and test that in isolation, or follow the instruction in the repo here: https://github.com/amir-zeldes/RFTokenizer

Does that library work? If so, it should fetch heb.seg for itself and hebpipe should be able to use it as well if it's been installed.

@idan-h
Copy link
Author

idan-h commented Jan 31, 2024

Debugging was a good idea, found the file at
venv\Lib\site-packages\rftokenizer\models\

I mean that a docker container is a must here - it will solve all of the dependencies issue.

Processing example_in.txt
Traceback (most recent call last):
  File "/hebpipe/heb_pipe.py", line 851, in <module>
    run_hebpipe()
  File "/hebpipe/heb_pipe.py", line 828, in run_hebpipe
    processed = nlp(input_text, do_whitespace=opts.whitespace, do_tok=dotok, do_tag=opts.posmorph, do_lemma=opts.lemma,
  File "/hebpipe/heb_pipe.py", line 636, in nlp
    lemmas = lemmatize(lemmatizer, zero_conllu, morphs)
  File "/hebpipe/heb_pipe.py", line 478, in lemmatize
    tok["id"] = int(tok["id"][0])
TypeError: list indices must be integers or slices, not str
Elapsed time: 0:00:25.953
========================================

I get this baby now

@amir-zeldes
Copy link
Owner

Hm, OK, from just this error message it's hard for me to know whether it's failing because of a version incompatibility (e.g. some version of stanza doesnt call the token id tok["id"]) or because an upstream module failed (e.g. the tokenizer never ran properly, so the lemmatizer is being fed something wrong). Can you try running venv\Lib\site-packages\rftokenizer\tokenize_rf.py -m heb on your text file to verify that it actually outputs segmented Hebrew? If so, then the problem is probably a stanza version issue; if not, RFTokenizer probably wasn't installed successfully/the tokenization model is broken.

@idan-h
Copy link
Author

idan-h commented Jan 31, 2024

Hm, OK, from just this error message it's hard for me to know whether it's failing because of a version incompatibility (e.g. some version of stanza doesnt call the token id tok["id"]) or because an upstream module failed (e.g. the tokenizer never ran properly, so the lemmatizer is being fed something wrong). Can you try running venv\Lib\site-packages\rftokenizer\tokenize_rf.py -m heb on your text file to verify that it actually outputs segmented Hebrew? If so, then the problem is probably a stanza version issue; if not, RFTokenizer probably wasn't installed successfully/the tokenization model is broken.

עיפרון
עיפרון הוא כלי כתיבה ידני לשם כתיבה וציור, לרוב על דפי נייר. העיפרון מורכב ממוט גרפיט, אשר לרוב מצופה בעץ. רכות הגרפיט מובילה לכך שבהשתפשפו בנייר הוא משאיר עליו פירורים זערוריים המהווים את רישום העיפרון. העיפרון הצבעוני מכיל פיגמנט. העיפרון נבדל ממרבית כלי הכתיבה (כמו למשל עטים, צבעי פנדה) בכך שניתן למחוק את תוצריו. לעיתים קרובות נמצא בקצהו האחד של העיפרון מחק. בעיפרון ממוצע ניתן לכתוב כ־50,000 מילים לפני שהוא נגמר. במהלך השימוש בעיפרון נהוג לחדדו באמצעות מחדד.

this is the output of
python ..\venv\Lib\site-packages\rftokenizer\tokenize_rf.py -m heb example_in.txt > tokenizer_output

also, a warning:

..\venv\Lib\site-packages\rftokenizer\tokenize_rf.py:218: DeprecationWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
  dframe.loc[:, column] = all_encoders_[idx].transform(dframe.loc[:, column].values)

@amir-zeldes
Copy link
Owner

Oh, whoops, right - rf_tokenize expects the data to be already white space tokenized like this in its input, that's why you're not getting anything meaningful. It's input would be a file like:

עיפרון
הוא
כלי
כתיבה
ידני
לשם
כתיבה
וציור

But the fact that it didn't crash suggests its installed correctly (the warning is not an issue). So I would guess it's a stanza version thing, since it's underspecified in requirements. What version do you have? If it's 1.7.0 because it simply got the latest, can you try 1.1.0?

@maayanorner
Copy link

maayanorner commented Feb 1, 2024

@amir-zeldes @idan-h
Hi folks,

I had the same issue.
1.1.0 doesn't exist according to pip.

pip install stanza==1.5.0
Fixed it.
Excited to see it working :)
The full pipeline still doesn't work, but the coref works with:
conda create -n hebpipe python=3.8
pip install hebpipe

py_38_requirements.txt:
scikit-learn==0.23.2
joblib==1.3.2
numpy==1.21.0
pandas==1.5.3
xgboost==0.81
hyperopt==0.2.4
flair==0.6.1
transformers==3.5.1
torch==1.6.0
gensim==3.8.3
diaparser==1.1.2
stanza==1.5.0

But the dependencies must be install as:

pip install --use-deprecated=legacy-resolver -r py_38_requirements.txt

@amir-zeldes
Copy link
Owner

I see, thanks for posting that information! Well, we could try to patch this together somehow using the existing models, but at this point I think the better path would be to retrain all of the models for the latest torch/stanza etc.

I'll see if I can get this to run - the MTL module doesn't seem to play nicely with torch 2.x (I think?) but it might be possible to get around this. If I can get training to run under a more recent version I'll post exact version requirements (I guess I was a bit lazy developing mainly for a paper deadline...)

@maayanorner
Copy link

I see, thanks for posting that information! Well, we could try to patch this together somehow using the existing models, but at this point I think the better path would be to retrain all of the models for the latest torch/stanza etc.

I'll see if I can get this to run - the MTL module doesn't seem to play nicely with torch 2.x (I think?) but it might be possible to get around this. If I can get training to run under a more recent version I'll post exact version requirements (I guess I was a bit lazy developing mainly for a paper deadline...)

Yes, it doesn't play nicely with torch 2.x and for some reason, the requirements contradict each other (things change and we install stuff in a particular order, so it's tricky), and it doesn't work with the newer STANZA versions but also not with older (breaking changes :P). If I will figure out how to run the pipeline as a whole I will do it while documenting the issues I encounter. Anyway, I just wanted to say that it's 100% understandable and that it's super highly appreciated and not taken for granted that you continue maintaining a research project with so many moving parts, so thank you!

Cheers :)

@idan-h
Copy link
Author

idan-h commented Feb 1, 2024

@amir-zeldes when it will be figured out, I will make a docker container so it will work out of the boxa

@amir-zeldes
Copy link
Owner

Thank you both, I appreciate it! Could you try installing this branch:

https://github.com/amir-zeldes/HebPipe/tree/no-mtl-compat

It should download its own models using torch 2.1, but please use a venv - I'm a little worried Stanza and other libraries have a tendency to install files to places like <USER>/stanza_resources/ and then even when you install totally different versions and try to encapsulate them, older models leak in and prevent things from running. I think the branch above should work, at least on a clean install, and plays nicely with Stanza 1.7.0, at least for me.

Let me know how it works and once it's all smoothed out I'm happy to merge and push to PyPI for easier installing.

PS - due to a compat issue I had to downgrade the transformer POS tagger, so that may be slightly less accurate for now. Segmentation and parsing still use AlephBERT though, so no accuracy hit there (actually scoring higher, probably a fluke).

@maayanorner
Copy link

Thank you both, I appreciate it! Could you try installing this branch:

https://github.com/amir-zeldes/HebPipe/tree/no-mtl-compat

It should download its own models using torch 2.1, but please use a venv - I'm a little worried Stanza and other libraries have a tendency to install files to places like <USER>/stanza_resources/ and then even when you install totally different versions and try to encapsulate them, older models leak in and prevent things from running. I think the branch above should work, at least on a clean install, and plays nicely with Stanza 1.7.0, at least for me.

Let me know how it works and once it's all smoothed out I'm happy to merge and push to PyPI for easier installing.

PS - due to a compat issue I had to downgrade the transformer POS tagger, so that may be slightly less accurate for now. Segmentation and parsing still use AlephBERT though, so no accuracy hit there (actually scoring higher, probably a fluke).

I will look into that soon (I use your work as a baseline so I will likely get to it pretty soon, I have tight deadlines currently from many different directions).

@maayanorner
Copy link

maayanorner commented Oct 21, 2024

Hi Amir,

I have trained a S2E model for the task.
https://github.com/IAHLT/iahlt_coref_he

Generally speaking, I think we use the same (linguistic) tokenization (I use a custom UDPipe with IAHLT scheme but RFT will be even better), perhaps it could be relatively simple to integrate this model into HebPipe (you can play with it a bit to see if it is good enough)?

https://github.com/IAHLT/iahlt_coref_he/blob/main/iahlt_coref_he/coref_pipeline.py
The class I implemented (well, I took code from a few other repositories as well) is provided here, it is pretty simple to use and modular. The code is very beta but it is pretty simple so it should probably be alright.

What do you think? Is that relevant?

@idan-h
Copy link
Author

idan-h commented Oct 21, 2024

Hi Amir,

I have trained a S2E model for the task. https://github.com/IAHLT/iahlt_coref_he

Generally speaking, I think we use the same (linguistic) tokenization (I use a custom UDPipe with IAHLT scheme but RFT will be even better), perhaps it could be relatively simple to integrate this model into HebPipe (you can play with it a bit to see if it is good enough)?

Hey, amazing work!
Just interested, why don't you upload it to hugging face?

@maayanorner
Copy link

maayanorner commented Oct 21, 2024

Hi Amir,
I have trained a S2E model for the task. https://github.com/IAHLT/iahlt_coref_he
Generally speaking, I think we use the same (linguistic) tokenization (I use a custom UDPipe with IAHLT scheme but RFT will be even better), perhaps it could be relatively simple to integrate this model into HebPipe (you can play with it a bit to see if it is good enough)?

Hey, amazing sork! Just interested, why don't you upload it to hugging face?

That's a good question.
I am not experienced with uploading projects of this type to HF, i.e., I do not know how to write a wrapper for that (usually I upload simpler projects that are already HF classes). Also, there are some dependencies that can make it tricky.

I will look into it, good idea.

@amir-zeldes
Copy link
Owner

That's very cool, thanks for offering to contribute! If it's working well and easy to integrate with the current dependencies then I don't see a reason not to try it. The current coref facilities in HebPipe are very basic so they wouldn't be hard to beat! Does the model support singletons (all mentions are in the output, even if only mentioned once), or only outputting coreferring chains of length 2+?

@maayanorner
Copy link

Hi Amir,

The model does not support singletons, the dataset it was trained on excluded singletons as well.
There are a few things that can be improved, I will attach a two random examples here:

[מייקל ג'פרי ג'ורדן]_0 ( ב אנגלית : [Michael Jeffrey Jordan]_0 ; נולד ב - 17 ב פברואר 1963 ) הוא כדורסלן עבר אמריקאי , ש שיחק ב עמדת הקלע . [ג'ורדן]_0 , אשר ניצב ב מקום ה חמישי ב רשימת ה קלעים ה מובילים ב תולדות ה - NBA , נחשב ב עינ י רבים לכדורסלן ה טוב ביותר ב היסטוריה , [ 1] [ 2 ] [ 3 ] ו זכה ל הצלחה רבה ש סייעה לפרסומה של [ליגת ה - NBA]_1 ב רחבי ה עולם ב שנות ה - 80 ו ה - 90 של ה מאה ה - 20 . הצטרף ל [שיקגו בולס]_2 ב עונת 1984/1985 , ו תוך זמן קצר הפך ל אחד ה כוכבים ה בולטים ב [ליגת ה - NBA]_1 . יכולותי [ו]_0 ככדורסלן וכאתלט זיכו אות [ו]_0 ב כינויים " אייר ג'ורדן " , " הוד אוויריות [ו]_0 " , " ישו ה שחור " ו - " אלוהים " . במהלך ה קריירה [ג'ורדן]_0 זכה ב מגוון תארים שונים , בין ה שאר , שש פעמים ב אליפות ה - NBA ב מדי [שיקגו בולס]_2 , ו פעמיים ב מדליית זהב אולימפית ב מדי נבחרת ארצות ה ברית . הישגי [ו]_0 ה אישיים כוללים חמש זכיות ב תואר MVP של ה עונה ה סדירה , עשר פעמים חבר בחמישיית ה עונה , תשע פעמים חבר בחמישיית ה הגנה , 14 השתתפויות ב משחק האולסטאר , שלוש זכיות ב תואר ה - MVP של משחק האולסטאר , שלוש זכיות ב תואר מלך ה חטיפות , שש זכיות ב תואר ה - MVP של סדרת ה גמר , ו זכייה ב תואר שחקן ה הגנה של ה עונה ב - 1988 . [ 4 ] ב נוסף , מחזיק [ג'ורדן]_0 ב שיא ממוצע ה נקודות ל קריירה ב - [NBA]_1 עם 30.12 נקודות ל משחק , ו ב שיא של 33.4 נקודות ב ממוצע ל משחק פלייאוף . ב שנת 2015 , לאור הישגי [ו]_0 יוצאי ה דופן ו תרומת [ו]_0 ל [ענף ה כדורסל]_1 , צורף [ג'ורדן]_0 ל היכל ה תהילה של פיב"א , ו ב - 2016 העניק ל [ו]_0 נשיא ארצות ה ברית ברק אובמה את מדליית ה חירות הנשיאותית , עיטור ה כבוד ה גבוה ביותר של ארצות ה ברית . [ 5 ] -NBA ה ראשון ש הפך ל בעלים של [קבוצה ב ליגה]_3 , ו כן ל בעלי [ה קבוצה]_3 ה יחידי ב ליגה מ מוצא אפרו - אמריקאי . במהלך שנותי [ו]_0 של [ג'ורדן]_0 כ בעלי [ה קבוצה]_3 , ההורנטס לא זכו ל הישגים מקצועיים משמעותיים , ו הגיעו רק שלוש פעמים ל מעמד ה פלייאוף . [ 6 ] אף על פי כן הצליח [ג'ורדן]_0 ב - 2023 למכור את רוב זכויות ה בעלות של [ו]_0 ב [קבוצה]_3 לפי שווי של כ שלושה מיליארד דולר . [ 7 ]

[רשת ה ריגול ה " אזרית "]_5 ש [ה שב"כ]_8 חשף [היום]_0 ( [שני]_0 ) מעידה ב בירור על חלוקת עבודה בין [משמרות ה מהפכה]_7 לשלוחים של [איראן]_1 . אות ם שלוחים מקבלים מ [איראן]_1 את ה אמצעים , את שיטות ה פעולה ו גם את ה ידע ש דרוש ל הרכב [ה]_1 ול תפעול מבצעי של מערכות נשק , ש [הם]_1 מפעילים במסגרת " טבעת ה אש " ש [ה איראנים]_1 הקימו כדי לפגוע ב [מדינת ישראל]_3 . אבל [השלוחים]_1 הם רק ה" ידיים " . את מטרות ה תקיפה , ו את ה מודיעין ה דרוש כדי להפוך כל מטרה ל - נ.צ מדויק , מספקים [ה איראנים]_1 . זה נכון בעיקר לגבי החות'ים ב תימן וה מיליציות ה שיעיות ב עיראק ו ב סוריה , אך נכון גם לגבי [חיזבאללה]_2 . אמנם [חיזבאללה]_2 מתוחכם יותר , ו יודע לאסוף מודיעין עבור [עצמו]_2 , אלא ש מדובר בעיקר ב מודיעין טקטי על [מטרות ב צפון [ישראל]_3]_4 , ש [חיזבאללה]_2 מנסה לפגוע ב [הן]_4 באמצעות רקטות , ו לפעמים על ידי מחבלים נושאי מטענים . את ה מטרות ה אסטרטגיות ל פגיעה , גם של [חיזבאללה]_2 ו גם של שאר השלוחים , בוחרת [טהרן]_1 כדי שישרת ו את ה יעדים ה אסטרטגיים של [[ה]_1 ב מלחמה ה עקיפה וה ישירה ש [היא]_1 מנהלת נגד [ישראל]_3 . [איראן]_1 רוצה לשלוט ב גובה ה להבות וב עוצמת ה נזק שהשלוחים של [ה]_1]_1 , ו [היא עצמה]_1 , מ סבים ל [ישראל]_3 . את כל זה אנחנו לומדים מ רשימת 600 " משימות ה איסוף " ש [איראן]_1 הטילה על [ה רשת האזרית]_5 . ה הנחיות ש ניתנו ל [אנשי [ה רשת]_5]_6 היו מדויקות ב אופן כ זה ש איסוף ה מודיעין יהיה תכליתי ו יאפשר לגרום נזק משמעותי . כמו שלוחים פחות מיומנים שגויסו על - ידי [איראן]_1 , גם [אנשי [ה רשת ה זאת]_5]_6 התבקשו לאסוף מודיעין על [אישים ישראלים]_9 . עובדה זו חושפת תסכול איראני עמוק מ ה עובדה ש עד היום לא הצליחה [טהרן]_1 לנקום על פי ה עיקרון ה תנ"כי " עין תחת עין " , על חיסולים של מדעני גרעין ו אנשי [משמרות ה מהפכה]_7 שחוסלו , לפי כלי תקשורת זרים , על ידי [ה מוסד ה ישראלי]_8 . ניסיון ה פגיעה ב שבת במ עונו של ראש ה ממשלה ב קיסריה מעיד ש מאמצי ה איסוף של [הם]_6 על [אישים ישראלים]_9 כבר הגיעו כבר ל שלב מתקדם , ו ש [הם]_6 יכולים ב שלב זה להזין את ה נתונים ה דרושים ל פגיעה מדויקת בראשי הנפץ של ה טילים או ה כטב"מים ש [הם]_10 או שלוחי [הם]_10 משגרים .

Interestingly, I see that sometimes pronouns are not detected as mentions (ones that I would consider easy?). I am not quite sure about the annotation guidelines or the reason for the bug but we can dig into that and perhaps train a better version if we find it feasible and useful.

@amir-zeldes
Copy link
Owner

Thanks @maayanorner , that's interesting! From a practical perspective I see some challenges in plugging in this module with the current architecture. The output of the system for entities/coref is currently a set of categorized spans in coreference clusters, where every mention gets annotated (including singletons) and assigned an entity type. Here is a visualization of what the current output looks like:

image

Colored boxes indicate coreference clusters, and gray ones are singletons. The icons in the corners correspond to the entity types (e.g. a person icon for PERSON). The system can also return conllu or XML, basically something like this:

<mention id="referent_2" head="2" cluster="2" etype="person">
עשרות
אנשים
</mention>
מגיעים
מ
<mention id="referent_3" head="5" cluster="3" etype="place">
תאילנד
</mention>
ל
<mention id="referent_4" head="7" cluster="4" etype="place">
ישראל
</mention>
כש
<mention id="referent_5" head="9" cluster="2" etype="person">
הם
</mention>
נרשמים
כ
<mention id="referent_6" head="12" cluster="6" etype="person">
מתנדבים
</mention>
,
אך
למעשה
משמשים
<mention id="referent_7" head="17" cluster="7" etype="person">
עובדים
שכירים
זולים
</mention>
.
<mention id="referent_8" head="21" cluster="8" etype="abstract">
תופעה
זו
</mention>
התבררה
אתמול
ב
<mention id="referent_9" head="26" cluster="9" etype="organization">
וועדת
<mention id="referent_10" head="28" cluster="10" etype="organization">
ה
עבודה
</mention>
ו
<mention id="referent_12" head="31" cluster="12" etype="abstract">
ה
רווחה
</mention>
של
<mention id="referent_13" head="34" cluster="13" etype="organization">
ה
כנסת
</mention>

I'm sure the coref component you are using is better than what the system is using right now, but structurally, it returns only a part of what the system is expecting. We could theoretically let the existing system keep outputting mentions and have your coref component make just the coreference decisions, but I'm guessing your system predicts coref and spans in one step, so it would not necessarily have coref predictions for all of the mention spans. Or in reverse, we could say we accept your system's coref judgments, but it's not guaranteed that those predicted clusters have categorized mention spans in the lower mention detection step. Does that make sense?

I think there are a couple of neural coref systems that can return singletons by now, and a recently graduated student of mine built one that also does entity type classification simultaneously in a multitask learning framework in this paper:

https://aclanthology.org/2023.ijcnlp-short.14/

But I don't know of Hebrew data like that that we could train it on... Very open to hearing ideas though, it would be nice to have a more serious support for coref!

@maayanorner
Copy link

maayanorner commented Oct 23, 2024

Thanks @maayanorner , that's interesting! From a practical perspective I see some challenges in plugging in this module with the current architecture. The output of the system for entities/coref is currently a set of categorized spans in coreference clusters, where every mention gets annotated (including singletons) and assigned an entity type. Here is a visualization of what the current output looks like:

image

Colored boxes indicate coreference clusters, and gray ones are singletons. The icons in the corners correspond to the entity types (e.g. a person icon for PERSON). The system can also return conllu or XML, basically something like this:

<mention id="referent_2" head="2" cluster="2" etype="person">
עשרות
אנשים
</mention>
מגיעים
מ
<mention id="referent_3" head="5" cluster="3" etype="place">
תאילנד
</mention>
ל
<mention id="referent_4" head="7" cluster="4" etype="place">
ישראל
</mention>
כש
<mention id="referent_5" head="9" cluster="2" etype="person">
הם
</mention>
נרשמים
כ
<mention id="referent_6" head="12" cluster="6" etype="person">
מתנדבים
</mention>
,
אך
למעשה
משמשים
<mention id="referent_7" head="17" cluster="7" etype="person">
עובדים
שכירים
זולים
</mention>
.
<mention id="referent_8" head="21" cluster="8" etype="abstract">
תופעה
זו
</mention>
התבררה
אתמול
ב
<mention id="referent_9" head="26" cluster="9" etype="organization">
וועדת
<mention id="referent_10" head="28" cluster="10" etype="organization">
ה
עבודה
</mention>
ו
<mention id="referent_12" head="31" cluster="12" etype="abstract">
ה
רווחה
</mention>
של
<mention id="referent_13" head="34" cluster="13" etype="organization">
ה
כנסת
</mention>

I'm sure the coref component you are using is better than what the system is using right now, but structurally, it returns only a part of what the system is expecting. We could theoretically let the existing system keep outputting mentions and have your coref component make just the coreference decisions, but I'm guessing your system predicts coref and spans in one step, so it would not necessarily have coref predictions for all of the mention spans. Or in reverse, we could say we accept your system's coref judgments, but it's not guaranteed that those predicted clusters have categorized mention spans in the lower mention detection step. Does that make sense?

I think there are a couple of neural coref systems that can return singletons by now, and a recently graduated student of mine built one that also does entity type classification simultaneously in a multitask learning framework in this paper:

https://aclanthology.org/2023.ijcnlp-short.14/

But I don't know of Hebrew data like that that we could train it on... Very open to hearing ideas though, it would be nice to have a more serious support for coref!

Hi Amir,

Yes, not supporting singletons and entity types is definitely a limitation. The output you demonstrate here looks great! Is that the output of the system or just a demo?

I can check if the dataset that includes singletons before their removal is useable or inconsistent.
What do you think? We can also follow up on that via email so I can CC the relevant people :) I also wonder, if is it possible to "hack" singletons with NER and POS tagging? We have released a dataset for NER https://github.com/IAHLT/hebrew_named_entities_open_dataset

You can email me with any questions regarding the data, perhaps I can find other versions internally.

That is the link for the coref data (which contains entity types, it is just that model that does not support it, I can try to train another one if you have a better tool - improving the results is exciting and very doable using another approach, my intuition is also that given the size of the dataset it can be useful to combine with NER and POS tagging, but I am not sure)
https://github.com/IAHLT/coref

PS, I think these types of method combinations will be problematic and perhaps complicate the system too much, it would be easier to just train a model that does what is needed :)

However, the definition in S2E seems hackable in this regard, since there are actually a few models, where one detects mentions (https://aclanthology.org/2021.acl-short.3.pdf f_m in the paper). I.e., perhaps it is possible to only use the antecedent scoring function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants