Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Daskification and Numba #153

Open
pgierz opened this issue Sep 30, 2021 · 23 comments
Open

Daskification and Numba #153

pgierz opened this issue Sep 30, 2021 · 23 comments
Labels
enhancement New feature or request

Comments

@pgierz
Copy link
Member

pgierz commented Sep 30, 2021

Hello,

This is more for planning and discussion: some of the first things I'd like to do in the exascale analytics project is to prepare PyFesom for "obscenely large data" (henceforth and forever to be abbreviated as OLD, since we are all getting old)

  1. Convert all Numpy things into Dask Arrays

  2. Where possible, apply Numba @jit decorators.

  3. Try to convert as much as possible into Xarray.

Thoughts and suggestions for other changes are welcome 😃

@pgierz pgierz added the enhancement New feature or request label Sep 30, 2021
@koldunovn
Copy link
Member

Hi @pgierz

Initially it was also my idea to use xarray everywhere, but it's my experience with OLD data, that restrain me from doing so at that time, and this is why I tried to keep possibility for functions to use both pure numpy arrays and dask. Probably some of my problems with early versions of xarray and dask are solved by now, and it's not a problem to use xarray in most of the places. Let's see.

At the end of the day whether we decide to completely switch to xarray representation or not, the most important for me is that pyfesom2 continue to be collection of functions, that are easy to extract and modify. This is how the science done - you invent new things, so it should be easy for the user to actually take the function and extract it completely from pyfesom. Now it's admittedly not that easy already, since we rely in many cases on the mesh object, but we should not make it more difficult hiding things in obscure methods of huge classes, for example.

@pgierz
Copy link
Member Author

pgierz commented Sep 30, 2021

@koldunovn Rather than rewriting everything, we could maintain a backwards compatible version in one of two ways:

A) make "sister functions" with slightly modified names (or a slightly renamed module)

B) Check for some environmental variable or some global setting and then either do numpy or dask with an if/else check.

What do you think?

@koldunovn
Copy link
Member

The best way to make a decision is to play around with different variants. I still have to become more comfortable with what @suvarchal did. Hope to spend good amount of time with pyfesom2 next week.

@pgierz
Copy link
Member Author

pgierz commented Oct 1, 2021

I started a bit with the "global option" idea, modeling it a little bit after what the HoloViz people do. In principle, the user will do the following:

import pyfesom2 as pf
pf.dask_extension() 

Or, for tools from the command line:

$ export PYFESOM2_USE_DASK="true"
$ pfinterp .....

If we incorporate this check directly into the function body, this would, unfortunately, make the code less "portable" in the sense that it clutters things and makes it a bit more difficult for users to just copy out a section they might want to incorporate into their own scripts without all the other pyfesom stuff around it. This is where the first idea would come in: we can write "sister functions", and then, depending on the current value, call either the normal version or the dask-enhanced version. That would look like this:

def pfinterp(*args):
	if pf.config.use_dask:
		return _pfinterp_dask(*args)
	# ...regular body of pfinterp...

Now, before I start tearing the code apart and putting if/then everywhere, it would be good first to identify which functions or routines would benefit most from this. Also, would this be a maintainable approach, or rather would it be confusing?

@koldunovn
Copy link
Member

@pgierz The sister functions looks like a good way forward to me. I would also heavily use the xarray accessor functions introduced by @suvarchal for the xarray type data. So at the end we would have pure numpy and xarray with FESOM flavour as possible inputs for the functions.

I hoped to start to spend this week mostly with pyfesom2 and fdiag, I desperately failed. I will try to do it next week, so much things need fixing, including currently terrible installation experience for the users...

@suvarchal
Copy link
Collaborator

suvarchal commented Oct 2, 2021

@pgierz thanks for the issue, among others it is really a favor to flush my head to consolidate and articulate my evolved views on it.

Few more points in favor of proposal, and probably more towards Xarrayfication first.

  • Because xarray by default implements numpy array protocol we can use its dataarrays (by default, based on numby objects) safely in numpy based functions like np.sum(xr.DataArray(np.random.random(100)) and in matplotlib for that matter. So I guess we shouldn't be bothered too much by being tied to Xarray. As common entry point to pyfesom2 is often by loading a dataset we should just start with functions to load a dataset with xarray specifics (without any chunks args for start) and a more involved part is what to do with mesh object (to also support old and new mesh files). With just this all existing library code (that uses numpy) should work as is and we can still reap benefits of lazy, random access loading of large meshes. Any outlier cases should be interesting, but that I think that reflects a much broader issue.

  • I think significant part of much needed daskification and its performance gains can come from a simple xarrayfication strategy: 1. load all data with xarray methods, 2. then function by function, when possible, replace relevant numpy specific functions like np.sum(arr) with xarray specific da.sum(...). Because this way of use of xarray methods automatically enables use of daskarrays (numpy funcs cant operate normally on daskarrays mostly because it is not obvious how to loop over additional dimension of chunks, and by using xarray methods handle this for us).

  • A subtle point, nevertheless important, I think, is by using xarray we promote the readability of library code. For a simple illuustration: I find .sum(dim=time) is more readable with clear intension then sum(ax=0), this also reduces chances of axes based bugs and makes it easy to spot such bugs.

  • Another elephant in the room is to address is plotting: customization and convenience of matplotlib vs needed performance abilities of datashader for HR data vs interactivity of holoviews/bokeh. To enable these we may need to have multiple functions that can be dynamically dispatched based on say a setting backend a parameter -- and i am sure all of us have many opinions on this.

  • For functions that are not daskified by above simple xarrayfication strategy we may need to write a alternate variant that uses numba/dask intrensics. Such variants of same function are also needed for plotting needs based on plotting backends. To have same name for all the variants we can use Python's singledispatch based on type annotations or arguments -- Interestingly, contrary to popular belief that Python's type annotations are passive and used only for type checking, single dispatch is an example where it is used by interpreter directly. So another argument why could enforce type annotations for new changes to the library apart from promoting readability.

  • xarray accessors are cool, provide enhanced user experience and new opportunities when used with datasets of known data structure like fesom data. Nevertheless, the temptation to make all functionality of library as part of accessor should be resisted for greater good, specifically should avoid any heavy analysis in accessor class. Accessor should be as lean as possible just enabling most used features or to enhance user experience. Most functionality in accessors should come from (Xarrayfied) functions of the library to make the library modular and be easily extendible. Current implementation of accessors was meant to be a demo of capabilities and user experience enabled by accessors not a reference implementation, it should ideally be done in the end.

more clutter still left in my head, eager to tell more on thoughts on how to do it without breaking existing functionality and incrementally (that i will follow-up on then in next comments) but would be curious to know on if it makes sense, comments and if you are convinced of worth a try.

@pgierz
Copy link
Member Author

pgierz commented Oct 2, 2021

Hello @suvarchal,

Thanks for the comments, they mirror a lot of what I have been thinking. I'll be working on exascale analysis tools full time as of next week, and I thought I'd start with FESOM.

Perhaps it would be good to plan a bit and split up the work, otherwise things will end up being done twice. I had a look yesterday at the accessor class, what I would missing there was a section in the module docstring with some examples. Maybe you could do that?

In the meantime, I can have a look at simply wrapping the load functions with xarray where I can, and I'll keep a list (in a separate issue) of which ones work out of the box, and where we have problematic cases.

As for plotting: I would say that is a second step. I have some ideas: we can maybe leverage the hvplot extension, but let's get everything in xarray and dask first.

@koldunovn
Copy link
Member

koldunovn commented Oct 2, 2021

Ok, sinse @pgierz you have some time and steam to do this, let's try to make it. From my side I suggest the following:

https://github.com/FESOM/pyfesom2/projects/4

Those are mostly my tasks, but I would like to suggestions from your side on datasets, and maybe some best ways to do things.

I suggest @pgierz and @suvarchal you create a daskification project and try sum up what would be worth trying first.

From my point of view it is important for us to conceptually decide how we handle the mesh - should it be part of the xarray data set, as Suvi implemented or a separate object (that will make it easy to pass to functions, and allow to have pure data as an input). I would also think on how to read the mesh. Do we still need to rely on ASCII files, or just switch to netCDF file of fesom.mesh.diag, as all other models do.

@pgierz
Copy link
Member Author

pgierz commented Oct 3, 2021

@koldunovn to not clutter this issue, I would open a new one to discuss what exactly what to do with the mesh. I would prefer to have this one only be linked to dask things not solved by xarray

@suvarchal, let's keep this one open for discussion. I'll go through tomorrow morning and make a project as Nikolay suggested, and once we have the "grunt work" done, we can use this place to discuss any particular edge cases.

@pgierz
Copy link
Member Author

pgierz commented Oct 4, 2021

Hello, please see and add tasks to https://github.com/FESOM/pyfesom2/projects/5

@koldunovn
Copy link
Member

  • A subtle point, nevertheless important, I think, is by using xarray we promote the readability of library code. For a simple illuustration: I find .sum(dim=time) is more readable with clear intension then sum(ax=0), this also reduces chances of axes based bugs and makes it easy to spot such bugs.

Just to reemphasise on this point - we plan to change the order of dimensions in the future versions of FESOM2 (will be [time, depth, vertices/elements]) so if we don't want to keep 2 versions of the code, there is no way around using dimension based operations :)

@koldunovn
Copy link
Member

  • For functions that are not daskified by above simple xarrayfication strategy we may need to write a alternate variant that uses numba/dask intrensics. Such variants of same function are also needed for plotting needs based on plotting backends. To have same name for all the variants we can use Python's singledispatch based on type annotations or arguments -- Interestingly, contrary to popular belief that Python's type annotations are passive and used only for type checking, single dispatch is an example where it is used by interpreter directly. So another argument why could enforce type annotations for new changes to the library apart from promoting readability.

I would just have 2 different functions. Unifications is great, but I prefer to get some expected behaviour from the function instead of realising after an hour of debugging, that I forgot to switch backends. But type annotations is really a good thing, so I would try to use it for future code/refactoring.

@koldunovn
Copy link
Member

  • array accessors are cool, provide enhanced user experience and new opportunities when used with datasets of known data structure like fesom data. Nevertheless, the temptation to make all functionality of library as part of accessor should be resisted for greater good, specifically should avoid any heavy analysis in accessor class. Accessor should be as lean as possible just enabling most used features or to enhance user experience. Most functionality in accessors should come from (Xarrayfied) functions of the library to make the library modular and be easily extendible. Current implementation of accessors was meant to be a demo of capabilities and user experience enabled by accessors not a reference implementation, it should ideally be done in the end.

So the idea would be first write the functions, that have xarray data (and mesh in one way or another) as input, and then assemble (some of) them in the accessor?

@pgierz
Copy link
Member Author

pgierz commented Oct 4, 2021

To follow up with Nikolay's last comment: I would need an example of this to fully understand. I have read online about the accessors, but still haven't fully grasped how this would work on either the programmer end nor the user end. If we want to make a spontaneous meeting, I can turn on my webex room after lunch, but let me know first.

@koldunovn
Copy link
Member

@suvarchal
Copy link
Collaborator

suvarchal commented Oct 4, 2021

To follow up with Nikolay's last comment: I would need an example of this to fully understand. I have read online about the accessors, but still haven't fully grasped how this would work on either the programmer end nor the user end. If we want to make a spontaneous meeting, I can turn on my webex room after lunch, but let me know first.

Can we push meeting couple of days later, like wednesday or later, occupied a bit with handling sick kid meanwhile.

@pgierz
Copy link
Member Author

pgierz commented Oct 4, 2021

Yes, was just an idea to get us all on the same page. Thursday afternoon or anytime Friday works for me

@suvarchal
Copy link
Collaborator

  • A subtle point, nevertheless important, I think, is by using xarray we promote the readability of library code. For a simple illuustration: I find .sum(dim=time) is more readable with clear intension then sum(ax=0), this also reduces chances of axes based bugs and makes it easy to spot such bugs.

Just to reemphasise on this point - we plan to change the order of dimensions in the future versions of FESOM2 (will be [time, depth, vertices/elements]) so if we don't want to keep 2 versions of the code, there is no way around using dimension based operations :)

not sure i get this, but do you mean..... there is no way around not using dimension based operations? that would fit my understanding.

@suvarchal
Copy link
Collaborator

Yes, was just an idea to get us all on the same page. Thursday afternoon or anytime Friday works for me

Both these times right now work for me.

@koldunovn
Copy link
Member

not sure i get this, but do you mean..... there is no way around not using dimension based operations? that would fit my understanding.

Well, you still can do it without, but it will require much more checks :)

@suvarchal
Copy link
Collaborator

There are 2 example notebooks that use accessors: https://github.com/FESOM/pyfesom2/blob/master/notebooks/accessor_selections.ipynb https://github.com/FESOM/pyfesom2/blob/master/notebooks/remote_datasets.ipynb

one more relevant one, still in a branch ( wasn't completely happy so didn't make a full fledged PR, I should do it soon 😄 )

@suvarchal
Copy link
Collaborator

not sure i get this, but do you mean..... there is no way around not using dimension based operations? that would fit my understanding.

Well, you still can do it without, but it will require much more checks :)

Now I am not convinced we mean same things 😄 , i guess i was confused by double negative in the sentence 😄

@suvarchal
Copy link
Collaborator

A few thoughts on strategy to approach Daskification, all subject to discussion.

  1. Preparation
    • Prepare example scripts using current code base that encompass breadth of pyfesom2 that we wish to target. These can serve as integration tests and more importantly track progress in performance gains, which can be very motivating. These examples may be already present in example notebooks, in that case need to be adapted to probably using Large meshes for testing #154 and data.
    • Revamp CI: fix any current issues, modernize, at least include coverage, (optionally doctest, mypy, linting as prehook, all subject to some discussion before)
  2. Daskify data input routines
    • data loading
    • mesh loading
  3. Daskify analysis functions
  4. Modernize Plotting but may be another issue
  5. Any code restructuring including refactoring accessors using newly daskified functions.

Will elaborate more in subsequent edits of this comments, right now a place holder :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants