Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing entities on data preparation with conll03_to_json.py #8

Open
roebbert92 opened this issue Apr 3, 2023 · 0 comments
Open

Missing entities on data preparation with conll03_to_json.py #8

roebbert92 opened this issue Apr 3, 2023 · 0 comments

Comments

@roebbert92
Copy link

Dear Tianyu Liu,

I really like the paper and the idea! And also thank you for releasing the code base!
I am currently working on my master's thesis and I am planning to augment this architecture with knowledge infusion.

While doing so, I encountered an issue with the code to convert the CoNLL03 dataset to the required json structure.
In the tables below, you can see that using your code (denoted eth_asp) does not capture 27 entities over the train, dev and test sets.
conll03

Your code does not check for entities at the end of the document -> they are not recognized.

I propose the following changes to your code:

          if line == "-DOCSTART- -X- -X- O":  # new doc
                if doc is not None:
                    # when extended is not the same as tokens
                    # mark where to copy from with <extra_id_22> and <extra_id_23>
                    # E.g.
                    # Extract entities such as apple, orange, lemon <extra_id_22> Give me a mango . <extra_id_23>
                    # See ace05_to_json.py for example of extending the input

                    # FIX: missing entities  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
                    if start is not None:
                        doc['entities'].append({
                            "type":
                            current_type,
                            "start":
                            start,
                            "end":
                            idx if idx > start else idx + 1
                        })
                    # FIX: missing entities >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
                    
                    doc["extended"] = doc["tokens"]
                    dataset.append(doc)
                doc = {
                    "tokens": [],  # list of tokens for the model to copy from
                    "extended":
                    [],  # list of input tokens. Prompts, instructions, etc. go here
                    "entities": [
                    ]  # list of dict:{"type": type, "start": start, "end": end}, format: [start, end)
                }
                idx, start = -1, None
                continue

Best regards,
Robin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant