Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error logs #8

Open
anjakefala opened this issue Aug 25, 2022 · 5 comments
Open

Error logs #8

anjakefala opened this issue Aug 25, 2022 · 5 comments

Comments

@anjakefala
Copy link
Contributor

anjakefala commented Aug 25, 2022

Fixed

title.principles.tsv.gz seems to have been momentarily corrupted. Made a PR with a try/except added, so at least the other tables would get built: #10

anja@allura:git/readysetdata ‹dougb_wpsummaries*›$ time make imdb
scripts/imdb.py -o output
4106s  76.09/408.01MB  (0.02 MB/s)  title.principals.tsv.gzTraceback (most recent call last):
  File "/home/anja/git/readysetdata/scripts/imdb.py", line 15, in <module>
    output_imdb('principals', 'title.principals.tsv.gz')
  File "/home/anja/git/readysetdata/scripts/imdb.py", line 9, in output_imdb
    rsd.output('imdb', tblname, rsd.parse_tsv(rsd.gunzip(fp)))
  File "/home/anja/git/readysetdata/readysetdata/output.py", line 20, in output
    with OutputTable(dbname, tblname) as out:
  File "/home/anja/git/readysetdata/readysetdata/utils.py", line 131, in parse_asv
    for line in Progress(it):
  File "/home/anja/git/readysetdata/readysetdata/utils.py", line 71, in __iter__
    for i, x in enumerate(self.iterator):
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/gzip.py", line 313, in read1
    return self._buffer.read1(size)
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/gzip.py", line 506, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
make: *** [Makefile:35: imdb] Error 1
make imdb  1959.46s user 551.53s system 19% cpu 3:37:23.65 total

Edit: title.principals.tsv.gz unzipped fine with gzip.

@anjakefala anjakefala changed the title make imdb error log Error logs Aug 25, 2022
@anjakefala
Copy link
Contributor Author

anja@allura:git/readysetdata ‹dougb_wpsummaries*›$ time make wikidata      
OUTDIR=output/wikidata scripts/wikidata.sh
[6041.3s] 1180688KilledMB  (0.18 MB/s)  latest-all.json.bz2
Traceback (most recent call last):
  File "/home/anja/git/readysetdata/scripts/download.py", line 11, in <module>
    sys.stdout.buffer.write(r)
BrokenPipeError: [Errno 32] Broken pipe
make: *** [Makefile:26: wikidata] Error 137
make wikidata  2253.43s user 305.53s system 42% cpu 1:40:47.28 total

@anjakefala
Copy link
Contributor Author

New url: # https://geonames.nga.mil/geonames/GNSData/fc_files/Whole_World.7z

URL and structure of zip have changed

anja@allura:git/readysetdata ‹dougb_wpsummaries*›$ scripts/geonames-nonus.py -o output
Traceback (most recent call last):
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._validate_conn(conn)
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/connectionpool.py", line 1040, in _validate_conn
    conn.connect()
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/connection.py", line 414, in connect
    self.sock = ssl_wrap_socket(
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/util/ssl_.py", line 449, in ssl_wrap_socket
    ssl_sock = _ssl_wrap_socket_impl(
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/util/ssl_.py", line 493, in _ssl_wrap_socket_impl
    return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/ssl.py", line 500, in wrap_socket
    return self.sslsocket_class._create(
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/ssl.py", line 1040, in _create
    self.do_handshake()
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/ssl.py", line 1309, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/anja/git/readysetdata/scripts/geonames-nonus.py", line 31, in <module>
    } for r in parse_asv(unzip_url(URL).open_text('Countries.txt'))))
  File "/home/anja/git/readysetdata/readysetdata/http_unzip.py", line 101, in open_text
    return io.TextIOWrapper(io.BufferedReader(self.open(fn)))
  File "/home/anja/git/readysetdata/readysetdata/http_unzip.py", line 81, in open
    f = list(self.matching_files(fn))
  File "/home/anja/git/readysetdata/readysetdata/http_unzip.py", line 75, in matching_files
    for f in self.files.values():
  File "/home/anja/git/readysetdata/readysetdata/http_unzip.py", line 41, in files
    return {r.filename:r for r in self.infolist()}
  File "/home/anja/git/readysetdata/readysetdata/http_unzip.py", line 41, in <dictcomp>
    return {r.filename:r for r in self.infolist()}
  File "/home/anja/git/readysetdata/readysetdata/http_unzip.py", line 44, in infolist
    resp = self.http.request('HEAD', self.url)
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/request.py", line 74, in request
    return self.request_encode_url(
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/request.py", line 96, in request_encode_url
    return self.urlopen(method, url, **extra_kw)
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/poolmanager.py", line 376, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/connectionpool.py", line 813, in urlopen
    return self.urlopen(
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/connectionpool.py", line 813, in urlopen
    return self.urlopen(
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/connectionpool.py", line 813, in urlopen
    return self.urlopen(
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/connectionpool.py", line 785, in urlopen
    retries = retries.increment(
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/site-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='geonames.nga.mil', port=443): Max retries exceeded with url: /gns/html/cntyfile/geonames_20220606.zip (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)')))

@anjakefala
Copy link
Contributor Author

make movielens

(It successfully completes, but has this one exception near the end)

453s  6.77/125.89MB  (0.01 MB/s)  movie_dataset_public_final/raw/ratings.json

Traceback (most recent call last):
  File "/home/anja/git/readysetdata/readysetdata/output.py", line 24, in output
    r = next(it)
  File "/home/anja/git/readysetdata/scripts/movielens.py", line 48, in <genexpr>
    output('movielens', 'ratings', ({
  File "/home/anja/git/readysetdata/readysetdata/utils.py", line 147, in __iter__
    yield AttrDict(json.loads(line))
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 28 (char 27)
None

0s  0.00/0.36MB  (0.00 MB/s)  movie_dataset_public_final/raw/survey_answers.json
[12.0s] 42100
12s  0.26/0.36MB  (0.02 MB/s)  movie_dataset_public_final/raw/survey_answers.json
[16.5s] 58500
17s  0.36/0.36MB  (0.02 MB/s)  movie_dataset_public_final/raw/survey_answers.json
[16.6s] 58900
17s  0.36/0.36MB  (0.02 MB/s)  movie_dataset_public_final/raw/survey_answers.json


17s  0.36/0.36MB  (0.02 MB/s)  movie_dataset_public_final/raw/survey_answers.json

@anjakefala
Copy link
Contributor Author

anjakefala commented Aug 25, 2022

Fixed

make wikipedia

  File "/home/anja/git/readysetdata/scripts/parse-wikipedia.py", line 15, in <module>
  File "/home/anja/git/readysetdata/readysetdata/output.py", line 16, in outputSingle
  File "/home/anja/git/readysetdata/readysetdata/output.py", line 98, in output
  File "/home/anja/git/readysetdata/readysetdata/output.py", line 99, in <listcomp>
  File "/home/anja/git/readysetdata/readysetdata/jsonl.py", line 29, in output_jsonl
  File "/home/anja/git/readysetdata/readysetdata/jsonl.py", line 9, in __init__
OSError: [Errno 24] Too many open files: 'output/wikipedia_infoboxes/hot_spring.jsonl'
Traceback (most recent call last):
  File "/home/anja/git/readysetdata/scripts/xml2jsonl.py", line 58, in <module>
    main()
  File "/home/anja/git/readysetdata/scripts/xml2jsonl.py", line 55, in main
    rdr.parse(sys.stdin)
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/xml/sax/expatreader.py", line 111, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/xml/sax/xmlreader.py", line 125, in parse
    self.feed(buffer)
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/xml/sax/expatreader.py", line 217, in feed
    self._parser.Parse(data, isFinal)
  File "/opt/conda/conda-bld/python-split_1654083059479/work/Modules/pyexpat.c", line 461, in EndElement
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/xml/sax/expatreader.py", line 336, in end_element
    self._cont_handler.endElement(name)
  File "/home/anja/git/readysetdata/scripts/xml2jsonl.py", line 44, in endElement
    print(json.dumps(simplify(contents)), file=self.fp)
BrokenPipeError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "/home/anja/git/readysetdata/scripts/download.py", line 11, in <module>
    sys.stdout.buffer.write(r)
BrokenPipeError: [Errno 32] Broken pipe
make: *** [Makefile:21: wikipedia] Error 1
make wikipedia  3230.65s user 17.81s system 106% cpu 50:46.76 total

saulpw added a commit that referenced this issue Aug 28, 2022
otherwise "[Errno 24] Too many open files"
@anjakefala
Copy link
Contributor Author

make wikipedia

3393s  482.54/21132.09MB  (0.14 MB/s)  enwiki-latest-pages-articles-multistream.xml.bz2
bunzip2: Compressed file ends unexpectedly;
        perhaps it is corrupted?  *Possible* reason follows.
bunzip2: Inappropriate ioctl for device
        Input file = (stdin), output file = (stdout)

It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.

[3392.4s] 66704Traceback (most recent call last):
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/xml/sax/expatreader.py", line 217, in feed
    self._parser.Parse(data, isFinal)
xml.parsers.expat.ExpatError: no element found: line 13647185, column 1107

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/anja/git/readysetdata/scripts/xml2jsonl.py", line 58, in <module>
    main()
  File "/home/anja/git/readysetdata/scripts/xml2jsonl.py", line 55, in main
    rdr.parse(sys.stdin)
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/xml/sax/expatreader.py", line 111, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/xml/sax/xmlreader.py", line 127, in parse
    self.close()
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/xml/sax/expatreader.py", line 240, in close
    self.feed(b"", isFinal=True)
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/xml/sax/expatreader.py", line 221, in feed
    self._err_handler.fatalError(exc)
  File "/home/anja/miniconda3/envs/deluxedata/lib/python3.9/xml/sax/handler.py", line 38, in fatalError
    raise exception
xml.sax._exceptions.SAXParseException: <stdin>:13647185:1107: no element found
cd output/wikipedia-infoboxes && zip -n .arrow ../wikipedia-infoboxes.zip *.jsonl
/bin/sh: 1: cd: can't cd to output/wikipedia-infoboxes
make: *** [Makefile:22: wikipedia] Error 2

@anjakefala anjakefala mentioned this issue Aug 28, 2022
Merged
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant