Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect plaintiff extraction in eyecite #193

Open
quevon24 opened this issue Jan 25, 2025 · 1 comment
Open

Incorrect plaintiff extraction in eyecite #193

quevon24 opened this issue Jan 25, 2025 · 1 comment

Comments

@quevon24
Copy link
Member

When parsing the following text with eyecite:

text = 'Lee County School Dist. No. 1 v. Gardner, 263 F.Supp. 26 (SC 1967)'

The plaintiff is incorrectly extracted as 1 instead of Lee County School Dist. No. 1.

The issue seems to arise from the algorithm used to identify the plaintiff, which relies on extracting the two "words" immediately preceding the stopword v.. However, the current implementation appears to count spaces as separate "words," which leads to incorrect results for plaintiffs with longer names.

For example, in the problematic text above, the tokenized list of "words" is:

['Lee', ' ', 'County', ' ', 'School', ' ', 'Dist.', ' ', 'No.', ' ', '1', ' ', StopWordToken(data='v.', start=30, end=32, groups={'stop_word': 'v'}), ' ', 'Gardner,', ' ', CitationToken(data='263 F.Supp. 26', start=42, end=56, groups={'volume': '263', 'reporter': 'F.Supp.', 'page': '26'}, exact_editions=(), variation_editions=(Edition(reporter=Reporter(short_name='F. Supp.', name='Federal Supplement', cite_type='federal', source='reporters', is_scotus=False), short_name='F. Supp.', start=datetime.datetime(1932, 1, 1, 0, 0), end=datetime.datetime(1988, 12, 31, 0, 0)),), short=False), ' ', '(SC', ' ', '1967)']

Using the current algorithm, the two tokens before v. are ['1', ' '], which is incorrect.

This logic works fine for shorter plaintiff names, such as:

text = 'Smith v. Bar, 263 F.Supp. 26 (SC 1967)'
Here, the tokenized list is:

['Smith', ' ', StopWordToken(data='v.', start=6, end=8, groups={'stop_word': 'v'}), ' ', 'Bar,', ' ', CitationToken(data='263 F.Supp. 26', start=14, end=28, groups={'volume': '263', 'reporter': 'F.Supp.', 'page': '26'}, exact_editions=(), variation_editions=(Edition(reporter=Reporter(short_name='F. Supp.', name='Federal Supplement', cite_type='federal', source='reporters', is_scotus=False), short_name='F. Supp.', start=datetime.datetime(1932, 1, 1, 0, 0), end=datetime.datetime(1988, 12, 31, 0, 0)),), short=False), ' ', '(SC', ' ', '1967)']

The two tokens before v. are ['Smith', ' '], which correctly identifies the plaintiff as Smith.

The current approach:

if isinstance(word, StopWordToken):
    if word.groups["stop_word"] == "v" and index > 0:
        citation.metadata.plaintiff = "".join(
            str(w) for w in words[max(index - 2, 0) : index]
        ).strip()

is limited to selecting the last two tokens before v.. This works for short names but fails for plaintiffs with longer names like Lee County School Dist. No. 1.

I'm guessing this was set to two elements before v. because is common for plaintiffs to have short names.

@quevon24 quevon24 moved this to General Backlog in Case Law Sprint Jan 25, 2025
@mlissner
Copy link
Member

Yeah, this is one of those things that we just had to do a bad job on. I'm not sure how we could do much better. Do you have ideas for handling this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: General Backlog
Development

No branches or pull requests

2 participants