Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with broken PMC files without xlink namespace #12

Open
jakelever opened this issue Feb 16, 2022 · 0 comments
Open

Dealing with broken PMC files without xlink namespace #12

jakelever opened this issue Feb 16, 2022 · 0 comments
Assignees

Comments

@jakelever
Copy link
Owner

jakelever commented Feb 16, 2022

A small number of PMC files use the xlink namespace without defining it first. For example, the documents include "xlink:href" where "xlink" hasn't be defined. This breaks the XML parser and gives errors like below.

Traceback (most recent call last):
  File "/projects/jlever/github/biotext/src/bioconverters/pmcxml.py", line 390, in pmcxml2bioc
    for pmc_doc in process_pmc_file(source, tag_handlers=tag_handlers):
  File "/projects/jlever/github/biotext/src/bioconverters/pmcxml.py", line 274, in process_pmc_file
    for event, elem in etree.iterparse(source, events=("start", "end", "start-ns", "end-ns")):
  File "/home/jlever/.linuxbrew/Cellar/python/3.7.3/lib/python3.7/xml/etree/ElementTree.py", line 1222, in iterator
    yield from pullparser.read_events()
  File "/home/jlever/.linuxbrew/Cellar/python/3.7.3/lib/python3.7/xml/etree/ElementTree.py", line 1297, in read_events
    raise event
  File "/home/jlever/.linuxbrew/Cellar/python/3.7.3/lib/python3.7/xml/etree/ElementTree.py", line 1269, in feed
    self._parser.feed(data)
xml.etree.ElementTree.ParseError: unbound prefix: line 12, column 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "src/convertPMC.py", line 56, in <module>
    for bioc_doc in pmcxml2bioc(io.StringIO(data)):
  File "/projects/jlever/github/biotext/src/bioconverters/pmcxml.py", line 450, in pmcxml2bioc
    raise RuntimeError("Parsing error in PMC xml file: %s" % source)
RuntimeError: Parsing error in PMC xml file: <_io.StringIO object at 0x7f04d1099c18>

An initial hacky fix was implemented in 63663fe and e30c3e9. This tried to fixed href specific cases. This needs to be explored further (as a new non href-related file) has appeared.

@jakelever jakelever self-assigned this Feb 16, 2022
@jakelever jakelever changed the title Dealing with broken PMC files without xref namespace Dealing with broken PMC files without xlink namespace Feb 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant