You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
line 8 of [PROJECT_ROOT]/lib/dor/was_crawl/dissemination/utilities.rb: run_sys_cmd
line 38 of [PROJECT_ROOT]/lib/dor/was_crawl/cdxj_generator_service.rb: generate_cdx_for_one_warc
line 14 of [PROJECT_ROOT]/lib/dor/was_crawl/cdxj_generator_service.rb: block in generate
It's possible to register a file with a space in its filename via WAS Registrar App. When the cdjx-indexer step tries to index the file, the command-line indexing will fail.
TMPDIR=/tmp /opt/app/was/.local/bin/poetry run cdxj-indexer /web-archiving-stacks/data/collections/sr233xh9483/kc/724/zz/3676/facebook-peter-chan-20220930184009658-60fd4c1f-414-0 (1)-rec-20220928220819708876-crawl-manual-20220928220757-60fd4c1f-414-0.warc.gz --output /web-archiving-stacks/data/indexes/cdxj_working/druid:kc724zz3676/facebook-peter-chan-20220930184009658-60fd4c1f-414-0 (1)-rec-20220928220819708876-crawl-manual-202...
Could be handled here, also possibly in WAS-Registrar-App to not allow uploading files with spaces.
edsu
changed the title
[was_robot_suite/stage] RuntimeError: Error in extracting CDXJ with command: TMPDIR=/tmp /opt/app/was/.local/bin/poetry run cdxj-indexer /web-archiving-stacks/data/collections/sr233xh9483/kc/724/zz/3676/facebook-peter-chan-20220930184009658-60fd4c1f-414-0 (1)-rec-20220928220819708876-crawl-manual-20220928220757-60fd4c1f-414-0.warc.gz --output /web-archiving-stacks/data/indexes/cdxj_working/druid:kc724zz3676/facebook-peter-chan-20220930184009658-60fd4c1f-414-0 (1)-rec-20220928220819708876-crawl-manual-202...
Spaces in WARC files
Dec 2, 2022
edsu
changed the title
Spaces in WARC files
Spaces in WARC filenames
Dec 2, 2022
Backtrace
View full backtrace and more info at honeybadger.io
The text was updated successfully, but these errors were encountered: