Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for new fsspec-style caching #18

Open
ian-r-rose opened this issue Sep 4, 2020 · 3 comments
Open

Support for new fsspec-style caching #18

ian-r-rose opened this issue Sep 4, 2020 · 3 comments

Comments

@ian-r-rose
Copy link

As far as I can tell, this driver doesn't support fsspec-style caching at the moment:

import intake_parquet

path = "https://github.com/apache/parquet-testing/raw/master/data/alltypes_plain.parquet"
intake_parquet.ParquetSource(urlpath=path, engine="pyarrow").read()  # works
intake_parquet.ParquetSource(urlpath="simplecache::"+path, engine="pyarrow").read()  # fails

Unsure if this is more appropriate to be fixed at the dask layer or here. Is there a roadmap for rolling out fsspec caching to various drivers?

@martindurant
Copy link
Member

Huff, OK.
I'm not entirely sure why, but the latter is doing an extra isdir somewhere, and regex can't find any links (which is reasonable). The following works, but I wonder what would happen if there was by chance a link in the binary data.

--- a/fsspec/implementations/http.py
+++ b/fsspec/implementations/http.py
@@ -12,8 +12,8 @@ from fsspec.asyn import sync_wrapper, sync, AsyncFileSystem, maybe_sync
 from ..caching import AllBytes

 # https://stackoverflow.com/a/15926317/3821154
-ex = re.compile(r"""<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1""")
-ex2 = re.compile(r"""(http[s]?://[-a-zA-Z0-9@:%_+.~#?&/=]+)""")
+ex = re.compile(b"""<a\\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1""")
+ex2 = re.compile(b"""(http[s]?://[-a-zA-Z0-9@:%_+.~#?&/=]+)""")


 async def get_client():
@@ -98,7 +98,7 @@ class HTTPFileSystem(AsyncFileSystem):
         kw.update(kwargs)
         async with self.session.get(url, **self.kwargs) as r:
             r.raise_for_status()
-            text = await r.text()
+            text = await r.read()
         if self.simple_links:
             links = ex2.findall(text) + ex.findall(text)
         else:
@@ -108,6 +108,7 @@ class HTTPFileSystem(AsyncFileSystem):
         for l in links:
             if isinstance(l, tuple):
                 l = l[1]
+            l = l.decode()
             if l.startswith("/") and len(l) > 1:
                 # absolute URL on this server
                 l = parts.scheme + "://" + parts.netloc + l

@martindurant
Copy link
Member

(PS: the parquet driver should not really be calling isdir, but isfile - and in HTTP, and sometimes s3/gcs, the same path can be both)

@zaneselvans
Copy link
Contributor

Lest anybody else be confused by this issue, fsspec based simplecache caching does seem to work now, and can be enabled by passing the right storage_option to dask.dataframe.read_parquet(). See @martindurant's comment on an issue of ours over here: catalyst-cooperative/pudl#1496 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants