Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

open_parquet + remote server #25

Open
jbogaardt opened this issue Sep 16, 2021 · 4 comments
Open

open_parquet + remote server #25

jbogaardt opened this issue Sep 16, 2021 · 4 comments

Comments

@jbogaardt
Copy link

Hi,

This may be a usage question and not a bug and I also apologize if this is not where this sort of question should be asked. I am struggling how to limit columns on a parquet file in an intake server. Based on the docs and the example notebook, I think this should work:

>>> import intake
>>> import intake_parquet
>>> intake_parquet.__version__, intake.__version__
('0.2.3', '0.6.3')
>>> cat = intake.open_catalog('intake://localhost:5555')
>>> type(cat.big_parquet)
intake.container.dataframe.RemoteDataFrame
>>> pq = intake.open_parquet(cat.big_parquet, columns=['Column 1'])
>>> type(pq) 
intake_parquet.source.ParquetSource
>>> pq.read()
...
TypeError: argument of type 'RemoteDataFrame' is not iterable

If I read the file directly without using intake.open_parquet, it works fine, but I am precluded from limiting the columns.

>>> cat.big_parquet.read()

Is this the expected behavior? Apologies in advance if I missed it in the docs.

@jbogaardt
Copy link
Author

Also including the yaml entry for the same:

    big_parquet:
        description: A big parquet file
        driver: parquet
        args:
          urlpath: '/mnt/datafiles/big_parquet.parquet'

@martindurant
Copy link
Member

Whether you get a "remote" dataset (RemoteDataFrame) or a "local" (ParquetSource) depends on the value of direct_access on the server. If it is "forbid", access is only via the server, and the client does not get to make any choices - this is the situation you are finding. It is intended that the server hides the origin of the data. Use "allow" to have the client open the data directly.

intake.open_parquet(cat.big_parquet, columns=['Column 1'])

No, this is not intended API - the parquet source will attempt to "open" the given variable, which should be one or more paths.

@jbogaardt
Copy link
Author

Thank you Martin, that clears up quite a bit. I like the idea of having the remote server handle the parquet querying. Passing the columns argument to the get method accomplishes that!

>>> import intake
>>> intake.__version__
'0.6.3'
>>> cat = intake.open_catalog('intake://localhost:5555')
>>> type(cat.big_parquet)
intake.container.dataframe.RemoteDataFrame
>>> # Works!
>>> cat.big_parquet.get(columns=['Column 1']).to_dask()

However when I try the filters arg as defined in dd.read_parquet(), the remote server throws an error. Is filters on a partitioned parquet file on the remote server not supported?

>>> # Doesn't work
>>> cat.big_parquet.get(columns=['Column 1'], filters=[('YEAR', '==', 2020)]).to_dask()

...
/conda/envs/dev/lib/python3.8/site-packages/intake/catalog/remote.py in get(self, **user_parameters)
    448         http_args['headers'] = self.http_args['headers'].copy()
    449         http_args['headers'].update(self.auth.get_headers())
--> 450         return open_remote(
    451             self.url, self.name, container=self.container,
    452             user_parameters=user_parameters, description=self.description,

/conda/envs/dev/lib/python3.8/site-packages/intake/catalog/remote.py in open_remote(url, entry, container, user_parameters, description, http_args, page_size, auth, getenv, getshell)
    507 
    508     else:
--> 509         raise Exception('Server error: %d, %s' % (req.status_code, req.reason))

Exception: Server error: 400, too many values to unpack (expected 3)

@martindurant
Copy link
Member

Actually, I'm surprised that the columns= kwarg worked either. The idea of remote sources (with direct access forbidden), is that the client should know nothing about how the source is loaded on the server - not even that it happens to be parquet. If you don't know the exact source type, you cannot know what arguments make sense, and this is supposed to all be up to the server to decide.

(the traceback shown by the server will give you more detail on exactly where this assumption is put into place)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants