open_parquet + remote server #25

jbogaardt · 2021-09-16T21:13:30Z

Hi,

This may be a usage question and not a bug and I also apologize if this is not where this sort of question should be asked. I am struggling how to limit columns on a parquet file in an intake server. Based on the docs and the example notebook, I think this should work:

>>> import intake
>>> import intake_parquet
>>> intake_parquet.__version__, intake.__version__
('0.2.3', '0.6.3')
>>> cat = intake.open_catalog('intake://localhost:5555')
>>> type(cat.big_parquet)
intake.container.dataframe.RemoteDataFrame
>>> pq = intake.open_parquet(cat.big_parquet, columns=['Column 1'])
>>> type(pq) 
intake_parquet.source.ParquetSource
>>> pq.read()
...
TypeError: argument of type 'RemoteDataFrame' is not iterable

If I read the file directly without using intake.open_parquet, it works fine, but I am precluded from limiting the columns.

>>> cat.big_parquet.read()

Is this the expected behavior? Apologies in advance if I missed it in the docs.

The text was updated successfully, but these errors were encountered:

jbogaardt · 2021-09-16T21:22:08Z

Also including the yaml entry for the same:

    big_parquet:
        description: A big parquet file
        driver: parquet
        args:
          urlpath: '/mnt/datafiles/big_parquet.parquet'

martindurant · 2021-09-17T17:55:02Z

Whether you get a "remote" dataset (RemoteDataFrame) or a "local" (ParquetSource) depends on the value of direct_access on the server. If it is "forbid", access is only via the server, and the client does not get to make any choices - this is the situation you are finding. It is intended that the server hides the origin of the data. Use "allow" to have the client open the data directly.

intake.open_parquet(cat.big_parquet, columns=['Column 1'])

No, this is not intended API - the parquet source will attempt to "open" the given variable, which should be one or more paths.

jbogaardt · 2021-09-18T17:08:16Z

Thank you Martin, that clears up quite a bit. I like the idea of having the remote server handle the parquet querying. Passing the columns argument to the get method accomplishes that!

>>> import intake
>>> intake.__version__
'0.6.3'
>>> cat = intake.open_catalog('intake://localhost:5555')
>>> type(cat.big_parquet)
intake.container.dataframe.RemoteDataFrame
>>> # Works!
>>> cat.big_parquet.get(columns=['Column 1']).to_dask()

However when I try the filters arg as defined in dd.read_parquet(), the remote server throws an error. Is filters on a partitioned parquet file on the remote server not supported?

>>> # Doesn't work
>>> cat.big_parquet.get(columns=['Column 1'], filters=[('YEAR', '==', 2020)]).to_dask()

...
/conda/envs/dev/lib/python3.8/site-packages/intake/catalog/remote.py in get(self, **user_parameters)
    448         http_args['headers'] = self.http_args['headers'].copy()
    449         http_args['headers'].update(self.auth.get_headers())
--> 450         return open_remote(
    451             self.url, self.name, container=self.container,
    452             user_parameters=user_parameters, description=self.description,

/conda/envs/dev/lib/python3.8/site-packages/intake/catalog/remote.py in open_remote(url, entry, container, user_parameters, description, http_args, page_size, auth, getenv, getshell)
    507 
    508     else:
--> 509         raise Exception('Server error: %d, %s' % (req.status_code, req.reason))

Exception: Server error: 400, too many values to unpack (expected 3)

martindurant · 2021-09-20T15:39:19Z

Actually, I'm surprised that the columns= kwarg worked either. The idea of remote sources (with direct access forbidden), is that the client should know nothing about how the source is loaded on the server - not even that it happens to be parquet. If you don't know the exact source type, you cannot know what arguments make sense, and this is supposed to all be up to the server to decide.

(the traceback shown by the server will give you more detail on exactly where this assumption is put into place)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

open_parquet + remote server #25

open_parquet + remote server #25

jbogaardt commented Sep 16, 2021

jbogaardt commented Sep 16, 2021

martindurant commented Sep 17, 2021

jbogaardt commented Sep 18, 2021

martindurant commented Sep 20, 2021

open_parquet + remote server #25

open_parquet + remote server #25

Comments

jbogaardt commented Sep 16, 2021

jbogaardt commented Sep 16, 2021

martindurant commented Sep 17, 2021

jbogaardt commented Sep 18, 2021

martindurant commented Sep 20, 2021