Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rewrite MultiZarrToZarr to always use xr.concat #33

Merged
merged 1 commit into from
Jun 28, 2021

Conversation

martindurant
Copy link
Member

@martindurant martindurant commented Jun 24, 2021

@ lsterzinger @rsignell-usgs - this applies what we discussed, to make to xr.concat explicit and pass a separate set of options to it. If this feels OK, I'll update the docstrings then.
example_ensamble() runs fine with this, quite fast (with the "minimal" and "override" arguments)

@rsignell-usgs
Copy link
Collaborator

Great news @martindurant! Which datasets did you try so far? (I'll try the others)

@martindurant
Copy link
Member Author

Only the noaa-nwm-retro-v2.0-pds 10-file example, which is "example_multi" in the hdf module.

@lsterzinger
Copy link
Collaborator

Can confirm that it works on my 1-day (144 files) of GOES netcdf data in Azure Blob storage. It took the processing time down from 55 minutes to 45 minutes, but only 25 of those minutes were CPU times so I'm not sure what's taking up the rest.

@martindurant
Copy link
Member Author

martindurant commented Jun 26, 2021 via email

@lsterzinger
Copy link
Collaborator

That's the time it takes to create a single reference with MultiZarrToZarr.translate() from 144 existing reference jsons, I'm using

mzz = MultiZarrToZarr(
    json_list,
#     "zip://jsons/*.json::combined.zip",
    remote_protocol='az',
    remote_options={
       'account_name' : 'goeseuwest'
    },    
    xarray_open_kwargs={
        'decode_cf' : False,
        'mask_and_scale' : False,
        'decode_times' : False,
        'use_cftime' : False,
        'decode_coords' : False,
    },
    xarray_concat_args={
        "data_vars": "minimal",
        "coords": "minimal",
        "compat": "override",
        "join": "override",
        "combine_attrs": "override",
        "dim": "t"

    }
)

mzz.translate('combined.json')

@lsterzinger
Copy link
Collaborator

I'm thinking this might be something to do with the azure support in fsspec. #31 and fsspec/filesystem_spec#681 did help my access time, but opening a filesystem from a single reference JSON pointing at 144 netcdf files (combined.json, created using the code above) takes 17min using

fs =  fsspec.filesystem('reference', fo='combined.json', 
                      remote_protocol='az', remote_options={'account_name':'goeseuwest'})

Is that expected for a dataset of this size? Each file is ~300 mb.

@martindurant
Copy link
Member Author

Just opening the file only needs reading the JSON file. If you run a profile/snakeviz, you'll probably find that rendering the templates is taking a long time - what counts is the number of chunks, not the size of those chunks. Probably you can pass simple_templates=True to significantly improve the time. Also, do you have ujson installed?

@martindurant
Copy link
Member Author

To the earlier question of why it takes so long to do the combination - you should turn on logging in adlfs or referenceFileSystem, to see what remote objects are being fetched. With the "minimal" and "override" options, I would have thought there are not too many, but this might be mistaken. The fact that CPU time is only a fraction of the total time suggests that there is significant contribution from latency, waiting for the remote server.

Also note that the opening of the individual filesystems and datasets as input to the combine class could be done in parallel using dask, much as the parallel= option does in xarray.open_mfdataset.

I suggest in any case, that we merge this, so that we can iterate and move on with things such as the docs.

@rsignell-usgs
Copy link
Collaborator

@martindurant and @lsterzinger , yes, let's merge this and keep on testing.

@martindurant martindurant merged commit 67ccf71 into fsspec:main Jun 28, 2021
@martindurant martindurant deleted the combine_api branch June 28, 2021 15:40
@martindurant
Copy link
Member Author

One optimisation I absultely should have mentioned is inline_threshold in SingleHdf5ToZarr. It's hard to know what the best value here is, it's a trade-off between small reads and inflating the output JSON, but I've typically been using 100-500bytes.

Finally, I have the comment in #25 that might be relevant: we make no attempt to join fetches with overlapping/contiguous/near-contiguous ranges. This would matter is several small pieces were being loaded from each of the files.

@lsterzinger
Copy link
Collaborator

@martindurant specifying simple_templates=True in fsspec.filesystem() didn't seem to do much.

How do I turn on logging for adlfs? I'm specifying remote_protocol='az' instead of 'abfs', is there a difference between the two? It seems like they're both using adlfs.

I also ran fsspec.filesystem() with a profiler, but I'm 100% confident in my ability to interpret the results. I've attached the profile file here, I was able to visualize it with snakeviz ./profile

profile.zip

@martindurant
Copy link
Member Author

"az" and "abfs" are the same

@martindurant
Copy link
Member Author

ALL of the time is spent in rendering templates with jinja2. Do use simple_templates=True, this should avoid jinja entirely, I don't know why it's not making a difference for you.
https://github.com/intake/filesystem_spec/blob/master/fsspec/implementations/reference.py#L229

@lsterzinger
Copy link
Collaborator

lsterzinger commented Jun 28, 2021 via email

@lsterzinger
Copy link
Collaborator

@martindurant looks like it's calling jinja2 from _process_references1() where jinja is used to render regardless

https://github.com/intake/filesystem_spec/blob/1e5263a3f38af4ba64a5dbaf0414707ed937826d/fsspec/implementations/reference.py#L286

@lsterzinger
Copy link
Collaborator

At least the call stack says it's called from that function, but it seems like it would only go to jinja if there's '{{' in u

@martindurant
Copy link
Member Author

Hah! It looks like the following line should simply not be there, it takes time and has no effect. Doubtless this is a bad merge.

@lsterzinger
Copy link
Collaborator

Okay yeah that makes sense. Should I open a PR to remove it?

@lsterzinger
Copy link
Collaborator

Removing that line brought the time down from ~20 minutes to just 16 seconds, I think you found the culprit!

@martindurant
Copy link
Member Author

martindurant commented Jun 28, 2021 via email

@martindurant
Copy link
Member Author

I can think of an even faster way to do it, if the 16s seems too long - if indeed most of the time is still spent in string substitution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants