New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add XMLParser( recover=True ) option to pyoai/client; add --recover option to harvest. #31

Open

sdm7g wants to merge 15 commits into bloomonkey:develop from sdm7g:fix-pyoai

sdm7g commented Apr 19, 2019

This adds XMLParser( recover=True ) option to pyoai/client and includes that module locally. #25
( I've had an outstanding pull request to fix code in that project. )
Adds an optional --recover / --no-recover option to command line args.
( default is --no-recover, to be conservative and keep the same behavior )
Adds code to log errors as warnings when using recover option, and to log which identifiers had parser errors, so that upstream feeds can be notified.
And fixed some issues with logger config that cropped up when testing this code:
harvest.py logging now goes to harvest.log, registry.py goes to registry.log #26 ;
and added some missing whitespace in help strings.

Steve Majewski added 9 commits

April 12, 2019 16:01


          test using (local) updated pyoai module

775b991


          swap streams in basicConfig

d963139


          finally think the logging is working correctly from PyErrorLog subclass


          use local version of client instead of oaipmh version until/unless it…

f2d389b

… gets updated


          only use recover=True parser option if that is passed as keyword to h…

f76e75a

…arvester


          use level symbolic names; add command line option to control recover …

df558d9

…parser


          add help text for recover

e2ae4d4


          check len as well, otherwise mock test were firing off this log message

1018ebb


          add missing whitespace in help strings

51dd080

bloomonkey requested changes

View reviewed changes

Owner

bloomonkey left a comment

Hi

First, thanks for taking the time and trouble to code this up and submit a PR. This looks like a very useful addition.

That said, I'm uneasy about adding this entire module into oaiharvest as this brings with it the burden of maintaining this code (which I didn't write and don't understand). I'd much prefer it if pyoai could approve your pull request and mint a new release... (I'll go and +1 your PR there shortly)

There a few other changes I've suggested to remove code you're replacing rather than commenting it out. The git repo will keep history for us, so there's no need for it to be in the latest version.

Thanks again, and good luck getting the pyoai change merged... 🤞

John

oaiharvest/harvest.py Outdated

                           if isinstance(metadata, str) and metadata.startswith("b'"):
                               metadata = ast.literal_eval(metadata).decode("utf-8")
                           yield (header, metadata, about)
+                          if client.XMLParser.error_log and len(client.XMLParser.error_log) > 0 :

Owner

bloomonkey Apr 26, 2019

Please remove space before : (PEP 8)

oaiharvest/harvest.py Outdated

-              ch = logging.StreamHandler()
+                  format='%(levelname)-8s %(message)s',
+                  # format='%(asctime)s %(name)-16s %(levelname)-8s %(message)s',

Owner

bloomonkey Apr 26, 2019

Please remove old code instead of commenting it out

oaiharvest/registry.py Outdated

-                  datefmt='[%Y-%m-%d %H:%M:%S]',
-                  filename=os.path.join(appdir, 'registry.log')
+                  format='%(levelname)-8s %(message)s'
+                  # format='%(asctime)s %(name)-16s %(levelname)-8s %(message)s',

Owner

bloomonkey Apr 26, 2019

Please remove old code instead of commenting it out

oaiharvest/registry.py Outdated

               )
               logger = logging.getLogger(__name__)
-              ch = logging.StreamHandler()
+              # ch = logging.StreamHandler()

Owner

bloomonkey Apr 26, 2019

Please remove old code instead of commenting it out

oaiharvest/registry.py Outdated

               ch.setLevel(logging.DEBUG)
-              formatter = logging.Formatter('%(levelname)-8s %(message)s')
+              # formatter = logging.Formatter('%(levelname)-8s %(message)s')

Owner

bloomonkey Apr 26, 2019

Please remove old code instead of commenting it out

oaiharvest/harvest.py Outdated

+              formatter = logging.Formatter(
+                  '%(asctime)s %(name)-16s %(levelname)-8s %(message)s',
+                  '[%Y-%m-%d %H:%M:%S]')
+              #formatter = logging.Formatter('%(levelname)-8s %(message)s')

Owner

bloomonkey Apr 26, 2019

Please remove old code instead of commenting it out

Author

sdm7g Apr 26, 2019

OK: commented code removed. ( I had left them in originally, as I was unsure of the intent of the original code, and wanted to make clear what exactly what behavior I was changing. )
I also added the --recover option to the doc-string at the top, which I missed earlier.


          remove commented code, add to doc at top

d3d020e

Author

sdm7g commented Apr 30, 2019

On further use and testing, I have found a couple of cases where it does recover from parsing Mal-formed payload, but where the issue is unmatching tags, the errors cascade outward to the OAI container and cause the resumption token to not be found and parsed correctly, thus the harvest halts. ( At least that my initial diagnosis. ) It still gives plenty of parser warning messages before halting, however, I'ld like to figure out a better way to unambiguously identify this case.
I still consider this behavior better than before, but perhaps it should be documented that --recover could have this side effect.

Steve Majewski added 5 commits

May 2, 2019 15:17


          if recover=True, also need to be less strict with xpaths in OAI wrapper;

57287e6

sometimes mismatched tag problems can propagate outwards to container and resumptionToken not found in exact path.


          otherwise files are being written as us-ascii w numerical character e…

7e73b33

…ntities

use unicode, not utf-8, so that it works under python 2.x as well as 3


          Had been getting SSL certificate problems on VH harvest: SSL:CERTIFIC…

c162d15

…ATE_VERIFY_FAILED

I don't know what needs to be fixed on the server ends (and I have no control over that end)
but I found this fix googling the problem (several fixes suggested)
https://www.howtouselinux.com/post/ssl-certificate_verify_failed-in-python


          add 502:Bad Gateway to expected error codes for proxy timeouts

11ee8b3


          timeout increased due to proxy problems

8920dad

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet