Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting full citation span #135

Open
overmode opened this issue Jan 12, 2023 · 15 comments
Open

Getting full citation span #135

overmode opened this issue Jan 12, 2023 · 15 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request

Comments

@overmode
Copy link
Contributor

Hi, thank you for the great library !

Problem description

I am preparing a dataset, in which I would like to mask some citations, e.g. replacing them by "[CITATION]".
I could not find a way to get the full span of the citation. Indeed, only the normalized part is covered by the builtin span() function (see below)

import eyecite         
       
citations = [
   'Commonwealth v. Gibson, 561 A.2d 1240 1242',
   'Commonwealth v. Bauer, 604 A.2d 1098 (Pa.Super. 1992)'
]

for citation in citations :
   print('\n', '='*20)
   extracted_citation = eyecite.get_citations(citation)[0]
   print(extracted_citation)
   
   start_idx = extracted_citation.span()[0]
   end_idx = extracted_citation.span()[1]
   
   before_cit = citation[:start_idx]
   cit_text = citation[start_idx:end_idx]
   after_cit = citation[end_idx:]
   print(f"{before_cit} [BEGIN] {cit_text} [END] {after_cit}")

output :

====================
FullCaseCitation('561 A.2d 1240', groups={'volume': '561', 'reporter': 'A.2d', 'page': '1240'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite='1242', year=None, court=None, plaintiff='Commonwealth', defendant='Gibson', extra=None))
Commonwealth v. Gibson,  [BEGIN] 561 A.2d 1240 [END]  1242

 ====================
FullCaseCitation('604 A.2d 1098', groups={'volume': '604', 'reporter': 'A.2d', 'page': '1098'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite=None, year='1992', court=None, plaintiff='Commonwealth', defendant='Bauer', extra=None))
Commonwealth v. Bauer,  [BEGIN] 604 A.2d 1098 [END]  (Pa.Super. 1992)

One can see that the span only partially covers the citation text.
If possible, I would like to avoid using regex for recovering the full span.
Concatenating the lengths of the citation's attributes (plaintiff, defendant, etc.) does not seem to be a viable solution as well, because the second example misses the "Pa. Super" text.

Desired behavior

It would be nice to have a 'full_span()' function such that, if I use it instead of span() in the above example, I get

====================
FullCaseCitation('561 A.2d 1240', groups={'volume': '561', 'reporter': 'A.2d', 'page': '1240'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite='1242', year=None, court=None, plaintiff='Commonwealth', defendant='Gibson', extra=None))
 [BEGIN]Commonwealth v. Gibson, 561 A.2d 1240 1242[END]

 ====================
FullCaseCitation('604 A.2d 1098', groups={'volume': '604', 'reporter': 'A.2d', 'page': '1098'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite=None, year='1992', court=None, plaintiff='Commonwealth', defendant='Bauer', extra=None))
[BEGIN]Commonwealth v. Bauer,  604 A.2d 1098 (Pa.Super. 1992)[END] 

Specs

eyecite version : 2.4.0

@flooie
Copy link
Contributor

flooie commented Jan 12, 2023

Hey @overmode

Thanks for the write up. There is a method for FullCaseCitations called corrected_citation_full

It returns the full normalized string.

        citations = [
            'the asdf asdf the asdfa sd Commonwealth v. Gibson, 561 A.2d 1240 1242 asdf asdf asdf ',
            'Commonwealth v. Bauer, 604 A.2d 1098 (Pa. Super. 1992)'
        ]
        for cite in citations:
            cite = get_citations(cite)[0].corrected_citation_full())

When you run it - it provides the full citation including names, but I believe there is a bug in it when it uses dates and courts.

if you wanted to take a look at eyecite.models.FullCaseCitation.corrected_citation_full and fix the bug related to date and court it would return something like

Commonwealth v. Gibson, 561 A.2d 1240
Commonwealth v. Bauer, 604 A.2d 1098 (pasuperct 1992)

for the example above.

@flooie flooie added bug Something isn't working enhancement New feature or request labels Jan 12, 2023
@mlissner
Copy link
Member

Is the idea, @overmode, to remove all citations to make it better training data?

@mlissner
Copy link
Member

One other thing to know, @overmode, is that the way we identify the name of the case is very sloppy. It just uses heuristics around where it finds a v., if it finds one, and otherwise, just grabs the average length of a case name, I think. It's hardcoded around 30 tokens IIRC>

@overmode
Copy link
Contributor Author

overmode commented Jan 13, 2023

Hey, thanks for the quick reply.
@mlissner Indeed, the idea is to build a training set for some machine learning application.

I took note of your method, it's ok if the recall of citation extraction is not excellent because I have many documents anyway, but I will need a way to tell whether the parsing went well to at least have a good precision.

@flooie I tried the eyecite.models.FullCaseCitation.corrected_citation_full method, it does break at the second example :

====================
EXTRACTED : FullCaseCitation('561 A.2d 1240', groups={'volume': '561', 'reporter': 'A.2d', 'page': '1240'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite='1242', year=None, court=None, plaintiff='Commonwealth', defendant='Gibson', extra=None))
CORRECTED_CITATION_FULL : Commonwealth v. Gibson, 561 A.2d 1240, 1242
CITATION SPAN : Commonwealth v. Gibson,  [BEGIN] 561 A.2d 1240 [END]  1242

 ====================
EXTRACTED : FullCaseCitation('604 A.2d 1098', groups={'volume': '604', 'reporter': 'A.2d', 'page': '1098'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite=None, year='1992', court=None, plaintiff='Commonwealth', defendant='Bauer', extra=None))
Error executing job with overrides: []
Traceback (most recent call last):
  File "check_samples.py", line 59, in main
    print('CORRECTED_CITATION_FULL :', extracted_citation.corrected_citation_full())
  File "/home/ubuntu/.local/lib/python3.8/site-packages/eyecite/models.py", line 361, in corrected_citation_full
    publisher_date = " ".join(m[i] for i in (m.court, m.year) if i)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/eyecite/models.py", line 361, in <genexpr>
    publisher_date = " ".join(m[i] for i in (m.court, m.year) if i)
TypeError: 'Metadata' object is not subscriptable

This is not exactly what I would like, though, because it is not exact text that was matched (notice the added comma between page numbers).
Is there a better way to find back the latter text ?

Also, Is this the bug you pointed out ? I'm open to a PR in case there is no better workaround, so I would appreciate if you have insights to share already.

[UPDATE]
I fixed the bug by replacing the line by publisher_date = " ".join(i for i in (m.court, m.year) if i)
The extracted full citation for the second example becomes
Commonwealth v. Bauer, 604 A.2d 1098 (1992

The parenthesis is not closed because in eyecite.models, line 362, we have

if publisher_date:
            parts.append(f" ({publisher_date}")

I assume that a parenthesis is missing at the end.
Does it make sense for the Pa. Super. not to be included here ?

@mattdahl
Copy link
Contributor

Just chiming in here since I saw your PR (#136) and was surprised that this wasn't already possible! Thanks for implementing it!

Separate from your changes in the PR, I was also curious about the court issue. It seems that the Pa.Super. is not being extracted properly because the citation_string listed for the PA Superior Court is "Pa. Super. Ct." (line 46902 here: https://github.com/freelawproject/courts-db/blob/main/courts_db/data/courts.json). The problem is the space between the Pa. and the Super.. This also seems like something that should be fixed -- would it cause problems to just ignore whitespace when matching court abbreviations here: https://github.com/freelawproject/eyecite/blob/main/eyecite/helpers.py#L52? May be related to the changes proposed in #129

@overmode
Copy link
Contributor Author

overmode commented Jan 18, 2023

@mattdahl Thanks !
Maybe we should consider moving away from exact string matching and use simple regex instead ?
For instance in r'\s*pa\s*super\s*', we would not be dependent on the spacing, and we could also make it robust to punctuation.
I don't think it would hurt speed a lot

@flooie
Copy link
Contributor

flooie commented Jan 18, 2023

@overmode every time I see the words simple and regex I get nervous. I'm not sure I see how this relatively simple situation is resolved with regex.

@overmode
Copy link
Contributor Author

I understand, regexes are powerful but scale badly.
Well, the equivalent in python here would be to remove all punctuation and spaces, and then look for 'pasuper'.
I think the question was more whether it would not work in some corner cases, and you are much more knowledgeable than I am.

@mlissner
Copy link
Member

For the court issue, the question is essentially, "What bad things will happen if we broaden how we match court strings against the text?"

Honestly, I don't think anybody knows. Right now we do two things. We:

  • Strip the punctuation with string_puc, and we
  • Use startswith to strip terminal periods, which sometimes seem to interfere

If we went a step further and matched with regexes or by taking out whitespace, would we have false matches? I don't know, but I know how to check!

If we want to run this down, I think the trick is to look at the citation_string values for every court in the courts DB and see what happens if you strip out spaces in addition to stripping out punctuation. I think it might be fine, but what we'd want to watch out for are two courts with nearly identical citation strings that overlap due to this. If there's no collisions caused by that analysis, I'd say yeah, let's add a third step to how we normalize and compare citation strings.

@mattdahl
Copy link
Contributor

Here's a gist doing that collision test: https://gist.github.com/mattdahl/a563a48ac512275d893907dd19acd4ae

It doesn't seem that removing whitespace causes any additional collisions, so I think we can safely do that. However, the fact that there are so many existing collisions also suggests that we probably shouldn't just be uncritically accepting the first match, as currently implemented.

@mlissner
Copy link
Member

However, the fact that there are so many existing collisions also suggests that we probably shouldn't just be uncritically accepting the first match, as currently implemented.

Yeah, that jumped out at me too. @flooie what's your take on that?

@flooie
Copy link
Contributor

flooie commented Jan 25, 2023

Here's a gist doing that collision test: https://gist.github.com/mattdahl/a563a48ac512275d893907dd19acd4ae

It doesn't seem that removing whitespace causes any additional collisions, so I think we can safely do that. However, the fact that there are so many existing collisions also suggests that we probably shouldn't just be uncritically accepting the first match, as currently implemented.

Screenshot 2023-01-25 at 2 42 41 PM

@flooie
Copy link
Contributor

flooie commented Jan 25, 2023

@mattdahl - we had imported a lot of courts - that were low level county, town courts and in ny a few of courts had been generated with the parent citation string.

For example, New York County Court -> has like 50+ County courts and they were generated with N.Y. Cty. Ct. as the citation string instead of NY Cty. Ct., Suffolk Cty. ... etc. I went thru and fixed the 100 or so collisions

@mattdahl
Copy link
Contributor

Nice!! The only duplicate left is N.Y. Cty. Ct., Nassau Cty. -- is that intentional?

@flooie
Copy link
Contributor

flooie commented Jan 25, 2023

no- ha - thats just a duplicate court. I'll strip that in a second. I have a few things to add about courts and citation strings. Ill add momentarily

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
Status: 🆕 New
Status: General Backlog
Status: No status
Development

No branches or pull requests

4 participants