Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix indescribeablebeast batch / processing #115

Open
techgique opened this issue Jan 13, 2022 · 2 comments
Open

Fix indescribeablebeast batch / processing #115

techgique opened this issue Jan 13, 2022 · 2 comments

Comments

@techgique
Copy link
Member

nbu_indescribeablebeast will not ingest and crashes with this error:

INFO:core.batch_loader:Assigned page sequence: 2
INFO:core.batch_loader:Saving page. issue date: 1924-10-20 00:00:00, page sequence: 2
ERROR:core.batch_loader:unable to load batch: EOL while scanning string literal (<string>, line 1)
ERROR:core.batch_loader:EOL while scanning string literal (<string>, line 1)
Traceback (most recent call last):
  File "/var/local/www/django/openoni/core/batch_loader.py", line 172, in load_batch
    issue = self._load_issue(mets_url)
  File "/var/local/www/django/openoni/core/batch_loader.py", line 294, in _load_issue
    page = self._load_page(doc, page_div, issue)
  File "/var/local/www/django/openoni/core/batch_loader.py", line 414, in _load_page
    self.process_ocr(page)
  File "/var/local/www/django/openoni/core/batch_loader.py", line 448, in process_ocr
    self.solr.add(**page.solr_doc)
  File "/var/local/www/django/openoni/ENV/lib/python2.7/site-packages/solr/core.py", line 684, in add
    return Solr.add_many(self, [fields], commit=_commit)
  File "/var/local/www/django/openoni/ENV/lib/python2.7/site-packages/solr/core.py", line 325, in wrapper
    content = function(self, *args, **kw)
  File "/var/local/www/django/openoni/ENV/lib/python2.7/site-packages/solr/core.py", line 512, in add_many
    self.__add(lst, doc)
  File "/var/local/www/django/openoni/ENV/lib/python2.7/site-packages/solr/core.py", line 598, in __add
    elem['value'] = escape(unicode(value))
  File "/var/local/www/django/openoni/ENV/lib/python2.7/site-packages/solr/core.py", line 1111, in __setitem__
    tmp = eval(value)
  File "<string>", line 1
    {'"Coolidge Starts
                     ^
SyntaxError: EOL while scanning string literal
WARNING:root:no OcrDump to delete for batch_nbu_indescribablebeast_ver01 (University of Nebraska-Lincoln Libraries, Lincoln, NE)
ERROR:core.management.commands.load_batch:unable to load batch: EOL while scanning string literal (<string>, line 1)
Traceback (most recent call last):
  File "/var/local/www/django/openoni/core/management/commands/load_batch.py", line 43, in handle
    batch = loader.load_batch(batch_path)
  File "/var/local/www/django/openoni/core/batch_loader.py", line 201, in load_batch
    raise BatchLoaderException(msg)
BatchLoaderException: unable to load batch: EOL while scanning string literal (<string>, line 1)
CommandError: Batch load failed. See logs/load_batch_#.log

The batch is available on Chronicling America at https://chroniclingamerica.loc.gov/batches/nbu_indescribablebeast_ver01/ and the page causing the error is at https://chroniclingamerica.loc.gov/lccn/sn84024326/1924-10-20/ed-1/seq-2/

batch_nbu_indescribablebeast_ver01/data/sn84024326/00332899314/1924102001/0359.xml appears to be the file the bug is coming from but I'm not certain yet how to bypass it at the moment. Still reviewing related code and how it handles the text.

@techgique
Copy link
Member Author

Have read through more of the code to understand how OCR text is processed for word coordinates etc.

Relevant section of 0359.xml:

<TextLine ID="LINE1" STYLEREFS="TS16" HEIGHT="349" WIDTH="2285" HPOS="448" VPOS="1548">
<String ID="S1" CONTENT="{&apos;&quot;Coolidge" WC="0.455" CC="5 8 6 7 7 1 5 0 5 7 3" HEIGHT="349" WIDTH="1441" HPOS="448" VPOS="1548"/>
<SP ID="SP1" WIDTH="77" HPOS="1892" VPOS="1568"/>
<String ID="S2" CONTENT="Starts" WC="0.778" CC="4 0 0 0 3 5" HEIGHT="281" WIDTH="761" HPOS="1972" VPOS="1568"/>
</TextLine>
<TextLine ID="LINE2" STYLEREFS="TS16" HEIGHT="441" WIDTH="2681" HPOS="452" VPOS="1948">
<String ID="S3" CONTENT=":" WC="0.222" CC="7" HEIGHT="141" WIDTH="29" HPOS="452" VPOS="2196"/>
<SP ID="SP2" WIDTH="229" HPOS="484" VPOS="2108"/>
<String ID="S4" CONTENT="&quot;letter" WC="0.794" CC="0 4 5 0 3 1 0" HEIGHT="329" WIDTH="1037" HPOS="716" VPOS="1948"/>
<SP ID="SP3" WIDTH="73" HPOS="1756" VPOS="2000"/>
<String ID="S5" CONTENT="Campaijrn" WC="0.568" CC="7 0 0 5 0 6 5 7 5" HEIGHT="389" WIDTH="1301" HPOS="1832" VPOS="2000"/>
</TextLine>

@techgique
Copy link
Member Author

techgique commented Jan 20, 2022

Fixed by removing special characters in front of CONTENT="{&apos;&quot;Coolidge" in 0359.xml. Copied the unedited file as 0359.xml.orig to try restoring once we upgrade to Open ONI 1.x. Will keep the issue open until we know whether the Solr library change handles the content or not

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant