You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
text = 'Lee County School Dist. No. 1 v. Gardner, 263 F.Supp. 26 (SC 1967)'
The plaintiff is incorrectly extracted as 1 instead of Lee County School Dist. No. 1.
The issue seems to arise from the algorithm used to identify the plaintiff, which relies on extracting the two "words" immediately preceding the stopword v.. However, the current implementation appears to count spaces as separate "words," which leads to incorrect results for plaintiffs with longer names.
For example, in the problematic text above, the tokenized list of "words" is:
The two tokens before v. are ['Smith', ' '], which correctly identifies the plaintiff as Smith.
The current approach:
if isinstance(word, StopWordToken):
if word.groups["stop_word"] == "v" and index > 0:
citation.metadata.plaintiff = "".join(
str(w) for w in words[max(index - 2, 0) : index]
).strip()
is limited to selecting the last two tokens before v.. This works for short names but fails for plaintiffs with longer names like Lee County School Dist. No. 1.
I'm guessing this was set to two elements before v. because is common for plaintiffs to have short names.
The text was updated successfully, but these errors were encountered:
When parsing the following text with eyecite:
text = 'Lee County School Dist. No. 1 v. Gardner, 263 F.Supp. 26 (SC 1967)'
The plaintiff is incorrectly extracted as 1 instead of Lee County School Dist. No. 1.
The issue seems to arise from the algorithm used to identify the plaintiff, which relies on extracting the two "words" immediately preceding the stopword v.. However, the current implementation appears to count spaces as separate "words," which leads to incorrect results for plaintiffs with longer names.
For example, in the problematic text above, the tokenized list of "words" is:
['Lee', ' ', 'County', ' ', 'School', ' ', 'Dist.', ' ', 'No.', ' ', '1', ' ', StopWordToken(data='v.', start=30, end=32, groups={'stop_word': 'v'}), ' ', 'Gardner,', ' ', CitationToken(data='263 F.Supp. 26', start=42, end=56, groups={'volume': '263', 'reporter': 'F.Supp.', 'page': '26'}, exact_editions=(), variation_editions=(Edition(reporter=Reporter(short_name='F. Supp.', name='Federal Supplement', cite_type='federal', source='reporters', is_scotus=False), short_name='F. Supp.', start=datetime.datetime(1932, 1, 1, 0, 0), end=datetime.datetime(1988, 12, 31, 0, 0)),), short=False), ' ', '(SC', ' ', '1967)']
Using the current algorithm, the two tokens before v. are ['1', ' '], which is incorrect.
This logic works fine for shorter plaintiff names, such as:
text = 'Smith v. Bar, 263 F.Supp. 26 (SC 1967)'
Here, the tokenized list is:
['Smith', ' ', StopWordToken(data='v.', start=6, end=8, groups={'stop_word': 'v'}), ' ', 'Bar,', ' ', CitationToken(data='263 F.Supp. 26', start=14, end=28, groups={'volume': '263', 'reporter': 'F.Supp.', 'page': '26'}, exact_editions=(), variation_editions=(Edition(reporter=Reporter(short_name='F. Supp.', name='Federal Supplement', cite_type='federal', source='reporters', is_scotus=False), short_name='F. Supp.', start=datetime.datetime(1932, 1, 1, 0, 0), end=datetime.datetime(1988, 12, 31, 0, 0)),), short=False), ' ', '(SC', ' ', '1967)']
The two tokens before v. are ['Smith', ' '], which correctly identifies the plaintiff as Smith.
The current approach:
is limited to selecting the last two tokens before v.. This works for short names but fails for plaintiffs with longer names like Lee County School Dist. No. 1.
I'm guessing this was set to two elements before v. because is common for plaintiffs to have short names.
The text was updated successfully, but these errors were encountered: