-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
open_parquet + remote server #25
Comments
Also including the yaml entry for the same: big_parquet:
description: A big parquet file
driver: parquet
args:
urlpath: '/mnt/datafiles/big_parquet.parquet' |
Whether you get a "remote" dataset (RemoteDataFrame) or a "local" (ParquetSource) depends on the value of direct_access on the server. If it is "forbid", access is only via the server, and the client does not get to make any choices - this is the situation you are finding. It is intended that the server hides the origin of the data. Use "allow" to have the client open the data directly.
No, this is not intended API - the parquet source will attempt to "open" the given variable, which should be one or more paths. |
Thank you Martin, that clears up quite a bit. I like the idea of having the remote server handle the parquet querying. Passing the >>> import intake
>>> intake.__version__
'0.6.3'
>>> cat = intake.open_catalog('intake://localhost:5555')
>>> type(cat.big_parquet)
intake.container.dataframe.RemoteDataFrame
>>> # Works!
>>> cat.big_parquet.get(columns=['Column 1']).to_dask() However when I try the >>> # Doesn't work
>>> cat.big_parquet.get(columns=['Column 1'], filters=[('YEAR', '==', 2020)]).to_dask()
...
/conda/envs/dev/lib/python3.8/site-packages/intake/catalog/remote.py in get(self, **user_parameters)
448 http_args['headers'] = self.http_args['headers'].copy()
449 http_args['headers'].update(self.auth.get_headers())
--> 450 return open_remote(
451 self.url, self.name, container=self.container,
452 user_parameters=user_parameters, description=self.description,
/conda/envs/dev/lib/python3.8/site-packages/intake/catalog/remote.py in open_remote(url, entry, container, user_parameters, description, http_args, page_size, auth, getenv, getshell)
507
508 else:
--> 509 raise Exception('Server error: %d, %s' % (req.status_code, req.reason))
Exception: Server error: 400, too many values to unpack (expected 3) |
Actually, I'm surprised that the columns= kwarg worked either. The idea of remote sources (with direct access forbidden), is that the client should know nothing about how the source is loaded on the server - not even that it happens to be parquet. If you don't know the exact source type, you cannot know what arguments make sense, and this is supposed to all be up to the server to decide. (the traceback shown by the server will give you more detail on exactly where this assumption is put into place) |
Hi,
This may be a usage question and not a bug and I also apologize if this is not where this sort of question should be asked. I am struggling how to limit columns on a parquet file in an intake server. Based on the docs and the example notebook, I think this should work:
If I read the file directly without using
intake.open_parquet
, it works fine, but I am precluded from limiting the columns.Is this the expected behavior? Apologies in advance if I missed it in the docs.
The text was updated successfully, but these errors were encountered: