Incorrect plaintiff extraction in eyecite #193

quevon24 · 2025-01-25T01:01:32Z

When parsing the following text with eyecite:

text = 'Lee County School Dist. No. 1 v. Gardner, 263 F.Supp. 26 (SC 1967)'

The plaintiff is incorrectly extracted as 1 instead of Lee County School Dist. No. 1.

The issue seems to arise from the algorithm used to identify the plaintiff, which relies on extracting the two "words" immediately preceding the stopword v.. However, the current implementation appears to count spaces as separate "words," which leads to incorrect results for plaintiffs with longer names.

For example, in the problematic text above, the tokenized list of "words" is:

['Lee', ' ', 'County', ' ', 'School', ' ', 'Dist.', ' ', 'No.', ' ', '1', ' ', StopWordToken(data='v.', start=30, end=32, groups={'stop_word': 'v'}), ' ', 'Gardner,', ' ', CitationToken(data='263 F.Supp. 26', start=42, end=56, groups={'volume': '263', 'reporter': 'F.Supp.', 'page': '26'}, exact_editions=(), variation_editions=(Edition(reporter=Reporter(short_name='F. Supp.', name='Federal Supplement', cite_type='federal', source='reporters', is_scotus=False), short_name='F. Supp.', start=datetime.datetime(1932, 1, 1, 0, 0), end=datetime.datetime(1988, 12, 31, 0, 0)),), short=False), ' ', '(SC', ' ', '1967)']

Using the current algorithm, the two tokens before v. are ['1', ' '], which is incorrect.

This logic works fine for shorter plaintiff names, such as:

text = 'Smith v. Bar, 263 F.Supp. 26 (SC 1967)'
Here, the tokenized list is:

['Smith', ' ', StopWordToken(data='v.', start=6, end=8, groups={'stop_word': 'v'}), ' ', 'Bar,', ' ', CitationToken(data='263 F.Supp. 26', start=14, end=28, groups={'volume': '263', 'reporter': 'F.Supp.', 'page': '26'}, exact_editions=(), variation_editions=(Edition(reporter=Reporter(short_name='F. Supp.', name='Federal Supplement', cite_type='federal', source='reporters', is_scotus=False), short_name='F. Supp.', start=datetime.datetime(1932, 1, 1, 0, 0), end=datetime.datetime(1988, 12, 31, 0, 0)),), short=False), ' ', '(SC', ' ', '1967)']

The two tokens before v. are ['Smith', ' '], which correctly identifies the plaintiff as Smith.

The current approach:

if isinstance(word, StopWordToken):
    if word.groups["stop_word"] == "v" and index > 0:
        citation.metadata.plaintiff = "".join(
            str(w) for w in words[max(index - 2, 0) : index]
        ).strip()

is limited to selecting the last two tokens before v.. This works for short names but fails for plaintiffs with longer names like Lee County School Dist. No. 1.

I'm guessing this was set to two elements before v. because is common for plaintiffs to have short names.

The text was updated successfully, but these errors were encountered:

mlissner · 2025-01-26T07:03:47Z

Yeah, this is one of those things that we just had to do a bad job on. I'm not sure how we could do much better. Do you have ideas for handling this?

quevon24 added this to Case Law Sprint Jan 25, 2025

quevon24 moved this to General Backlog in Case Law Sprint Jan 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect plaintiff extraction in eyecite #193

Incorrect plaintiff extraction in eyecite #193

quevon24 commented Jan 25, 2025

mlissner commented Jan 26, 2025

Incorrect plaintiff extraction in eyecite #193

Incorrect plaintiff extraction in eyecite #193

Comments

quevon24 commented Jan 25, 2025

mlissner commented Jan 26, 2025