Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

authors: catastrophic backtracking in regex #26

Open
jacquerie opened this issue Apr 12, 2017 · 3 comments
Open

authors: catastrophic backtracking in regex #26

jacquerie opened this issue Apr 12, 2017 · 3 comments
Assignees

Comments

@jacquerie
Copy link
Contributor

jacquerie commented Apr 12, 2017

How to reproduce:

>>> from refextract import extract_references_from_string
>>> extract_references_from_string('G. W. and L. B. and M. M. G. and T. A. and E. L. I. and E. P. and X. M. and B. Urbaszek, Magneto-optics in transition metal diselenide monolayers. 2D Mater. 2, 34002 (2015).')

this hangs refextract for, at least, days.

The reason appears to be catastrophic backtracking in this regex:

re_weaker_author = ur"""
## look closely for initials, and less closely at the last name.
(?:([A-Z]((\.\s?)|(\.?\s+)|(\-))){1,5}
(?:[^\s_<>0-9]+(?:(?:[,\.]\s*)|(?:[,\.]?\s+)))+)"""
.

@david-caro
Copy link
Contributor

This is the article that causes the issue, it should be reharvested once this is fixed: arXiv:1704.00841

kaplun added a commit to kaplun/inspire-next that referenced this issue Jun 15, 2017
Worksaround inspirehep/refextract#26 by interrupting the running away
refextract process.

Signed-off-by: Samuele Kaplun <[email protected]>
kaplun added a commit to kaplun/inspire-next that referenced this issue Jun 15, 2017
Worksaround inspirehep/refextract#26 by interrupting the running away
refextract process.

Signed-off-by: Samuele Kaplun <[email protected]>
kaplun added a commit to kaplun/inspire-next that referenced this issue Jun 15, 2017
Worksaround inspirehep/refextract#26 by interrupting the running away
refextract process.

Signed-off-by: Samuele Kaplun <[email protected]>
@kaplun
Copy link
Contributor

kaplun commented Jun 15, 2017

@tsgit are you by chance going to work on this issue in the near future? For the time being we have a workaround, but the approach you outlined in chat sounded way better than a workaround.

jacquerie pushed a commit to kaplun/inspire-next that referenced this issue Jun 15, 2017
Times out the `refextract` task after 300 seconds to work around
inspirehep/refextract#26, which would otherwise block a Celery
worker indefinitely.

Signed-off-by: Samuele Kaplun <[email protected]>
jacquerie pushed a commit to kaplun/inspire-next that referenced this issue Jun 15, 2017
Times out the `refextract` task after 300 seconds to work around
inspirehep/refextract#26, which would otherwise block a Celery
worker indefinitely.

Signed-off-by: Samuele Kaplun <[email protected]>
@tsgit
Copy link
Contributor

tsgit commented Jun 15, 2017

@kaplun yes, very high on my todo list. unfortunately got pushed back by AAHEP, vacation, surgery and some other business -- by next week!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants