-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add XMLParser( recover=True ) option to pyoai/client; add --recover option to harvest. #31
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi
First, thanks for taking the time and trouble to code this up and submit a PR. This looks like a very useful addition.
That said, I'm uneasy about adding this entire module into oaiharvest as this brings with it the burden of maintaining this code (which I didn't write and don't understand). I'd much prefer it if pyoai could approve your pull request and mint a new release... (I'll go and +1 your PR there shortly)
There a few other changes I've suggested to remove code you're replacing rather than commenting it out. The git repo will keep history for us, so there's no need for it to be in the latest version.
Thanks again, and good luck getting the pyoai change merged... 🤞
John
oaiharvest/harvest.py
Outdated
@@ -129,6 +129,9 @@ def _listRecords(self, baseUrl, metadataPrefix="oai_dc", **kwargs): | |||
if isinstance(metadata, str) and metadata.startswith("b'"): | |||
metadata = ast.literal_eval(metadata).decode("utf-8") | |||
yield (header, metadata, about) | |||
if client.XMLParser.error_log and len(client.XMLParser.error_log) > 0 : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove space before :
(PEP 8)
oaiharvest/harvest.py
Outdated
|
||
ch = logging.StreamHandler() | ||
format='%(levelname)-8s %(message)s', | ||
# format='%(asctime)s %(name)-16s %(levelname)-8s %(message)s', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove old code instead of commenting it out
oaiharvest/registry.py
Outdated
datefmt='[%Y-%m-%d %H:%M:%S]', | ||
filename=os.path.join(appdir, 'registry.log') | ||
format='%(levelname)-8s %(message)s' | ||
# format='%(asctime)s %(name)-16s %(levelname)-8s %(message)s', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove old code instead of commenting it out
oaiharvest/registry.py
Outdated
) | ||
logger = logging.getLogger(__name__) | ||
ch = logging.StreamHandler() | ||
# ch = logging.StreamHandler() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove old code instead of commenting it out
oaiharvest/registry.py
Outdated
ch.setLevel(logging.DEBUG) | ||
formatter = logging.Formatter('%(levelname)-8s %(message)s') | ||
# formatter = logging.Formatter('%(levelname)-8s %(message)s') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove old code instead of commenting it out
oaiharvest/harvest.py
Outdated
formatter = logging.Formatter( | ||
'%(asctime)s %(name)-16s %(levelname)-8s %(message)s', | ||
'[%Y-%m-%d %H:%M:%S]') | ||
#formatter = logging.Formatter('%(levelname)-8s %(message)s') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove old code instead of commenting it out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK: commented code removed. ( I had left them in originally, as I was unsure of the intent of the original code, and wanted to make clear what exactly what behavior I was changing. )
I also added the --recover option to the doc-string at the top, which I missed earlier.
On further use and testing, I have found a couple of cases where it does recover from parsing Mal-formed payload, but where the issue is unmatching tags, the errors cascade outward to the OAI container and cause the resumption token to not be found and parsed correctly, thus the harvest halts. ( At least that my initial diagnosis. ) It still gives plenty of parser warning messages before halting, however, I'ld like to figure out a better way to unambiguously identify this case. |
sometimes mismatched tag problems can propagate outwards to container and resumptionToken not found in exact path.
…ntities use unicode, not utf-8, so that it works under python 2.x as well as 3
…ATE_VERIFY_FAILED I don't know what needs to be fixed on the server ends (and I have no control over that end) but I found this fix googling the problem (several fixes suggested) https://www.howtouselinux.com/post/ssl-certificate_verify_failed-in-python
This adds XMLParser( recover=True ) option to pyoai/client and includes that module locally. #25
( I've had an outstanding pull request to fix code in that project. )
Adds an optional --recover / --no-recover option to command line args.
( default is --no-recover, to be conservative and keep the same behavior )
Adds code to log errors as warnings when using recover option, and to log which identifiers had parser errors, so that upstream feeds can be notified.
And fixed some issues with logger config that cropped up when testing this code:
harvest.py logging now goes to harvest.log, registry.py goes to registry.log #26 ;
and added some missing whitespace in help strings.