Skip to content

Commit

Permalink
Better handling of wpull exit status codes
Browse files Browse the repository at this point in the history
Currently the crawl management command always returns a zero exit code
even if wpull has some kind of serious error. By default wpull returns
a non-zero exit code for *any* failure, which is too sensitive - we
don't want downstream processing to fail just because the crawler
can't resolve the DNS of a link, for example.

This change reintroduces the wpull exit status codes but only for
errors that don't relate to network failures (DNS, connectivity, etc).
  • Loading branch information
chosak committed Nov 6, 2023
1 parent 22f2b32 commit b36fa5b
Show file tree
Hide file tree
Showing 2 changed files with 25 additions and 1 deletion.
2 changes: 1 addition & 1 deletion crawler/management/commands/crawl.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,4 +79,4 @@ def command(start_url, db_filename, max_pages, depth, recreate, resume):
# https://docs.djangoproject.com/en/3.2/topics/async/#async-safety
os.environ["DJANGO_ALLOW_ASYNC_UNSAFE"] = "true"

app.run_sync()
return app.run_sync()
24 changes: 24 additions & 0 deletions crawler/wpull_plugin.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@

from wpull.application.hook import Actions
from wpull.application.plugin import PluginFunctions, WpullPlugin, hook
from wpull.errors import ExitStatus
from wpull.network.connection import BaseConnection
from wpull.pipeline.item import URLProperties
from wpull.url import URLInfo
Expand Down Expand Up @@ -271,3 +272,26 @@ def process_200_response(self, request, response):

html = response.body.content().decode("utf-8")
return Page.from_html(request.url, html, self.start_url.hostname)

@hook(PluginFunctions.exit_status)
def exit_status(self, app_session, exit_code):
# If a non-zero exit code exists because of some kind of network error
# (DNS resolution, connection issue, etc.) we want to ignore it and
# instead return a zero error code. We expect to encounter some of
# these errors when we crawl, but we don't want the overall process to
# fail downstream processing.
#
# See list of wpull exit status codes here:
# https://github.com/ArchiveTeam/wpull/blob/v2.0.1/wpull/errors.py#L40-L63
return (
0
if exit_code
in (
ExitStatus.network_failure,
ExitStatus.ssl_verification_error,
ExitStatus.authentication_failure,
ExitStatus.protocol_error,
ExitStatus.server_error,
)
else exit_code
)

0 comments on commit b36fa5b

Please sign in to comment.