Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spaces in WARC filenames #576

Open
lwrubel opened this issue Oct 10, 2022 — with Honeybadger · 1 comment
Open

Spaces in WARC filenames #576

lwrubel opened this issue Oct 10, 2022 — with Honeybadger · 1 comment

Comments

Copy link
Contributor

lwrubel commented Oct 10, 2022

Backtrace

line 8 of [PROJECT_ROOT]/lib/dor/was_crawl/dissemination/utilities.rb: run_sys_cmd
line 38 of [PROJECT_ROOT]/lib/dor/was_crawl/cdxj_generator_service.rb: generate_cdx_for_one_warc
line 14 of [PROJECT_ROOT]/lib/dor/was_crawl/cdxj_generator_service.rb: block in generate

View full backtrace and more info at honeybadger.io

@lwrubel
Copy link
Contributor Author

lwrubel commented Oct 10, 2022

It's possible to register a file with a space in its filename via WAS Registrar App. When the cdjx-indexer step tries to index the file, the command-line indexing will fail.

TMPDIR=/tmp /opt/app/was/.local/bin/poetry run cdxj-indexer /web-archiving-stacks/data/collections/sr233xh9483/kc/724/zz/3676/facebook-peter-chan-20220930184009658-60fd4c1f-414-0 (1)-rec-20220928220819708876-crawl-manual-20220928220757-60fd4c1f-414-0.warc.gz --output /web-archiving-stacks/data/indexes/cdxj_working/druid:kc724zz3676/facebook-peter-chan-20220930184009658-60fd4c1f-414-0 (1)-rec-20220928220819708876-crawl-manual-202...

Could be handled here, also possibly in WAS-Registrar-App to not allow uploading files with spaces.

@edsu edsu changed the title [was_robot_suite/stage] RuntimeError: Error in extracting CDXJ with command: TMPDIR=/tmp /opt/app/was/.local/bin/poetry run cdxj-indexer /web-archiving-stacks/data/collections/sr233xh9483/kc/724/zz/3676/facebook-peter-chan-20220930184009658-60fd4c1f-414-0 (1)-rec-20220928220819708876-crawl-manual-20220928220757-60fd4c1f-414-0.warc.gz --output /web-archiving-stacks/data/indexes/cdxj_working/druid:kc724zz3676/facebook-peter-chan-20220930184009658-60fd4c1f-414-0 (1)-rec-20220928220819708876-crawl-manual-202... Spaces in WARC files Dec 2, 2022
@edsu edsu changed the title Spaces in WARC files Spaces in WARC filenames Dec 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant