Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting chunks auto in open_mfdataset #95

Merged
merged 9 commits into from
Feb 2, 2024
Merged

Setting chunks auto in open_mfdataset #95

merged 9 commits into from
Feb 2, 2024

Conversation

SarahAlidoost
Copy link
Member

@SarahAlidoost SarahAlidoost commented Feb 2, 2024

close #94

In this PR:

  • Set chunks to "auto" to avoid memory issues in xr.open_mfdataset, because, by default, chunks will be chosen to load entire input files into memory at once. see doc.
  • In timestep "1800S", S is replaced with s to fix pandas: FutureWarning: 'S' is deprecated and will be removed in a future version, please use 's' instead. This works also for pandas < 2, see source code.
  • Set dask.config.set({"array.slicing.split_large_chunks": True}) to avoid creating the large chunk, because of PerformanceWarning: Slicing is producing a large chunk, see doc.

There is still another PerformanceWarning: Increasing number of chunks by factor . This is due to internal re-chunking and might be solved by zampy. see dask source code.

@SarahAlidoost SarahAlidoost changed the title Fix memory problems Setting chunks auto in open_mfdataset Feb 2, 2024
Copy link
Contributor

@BSchilperoort BSchilperoort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad you were able to find a way to fix this!

I have also found that open_mfdataset can be quite slow. In cases where you have big datasets, and know well how to concatenate/merge the data, opening the files separately and then defining the merging operations manually can lead to better performance.

The code here is fine as is, it'll be mostly replaced anyway once we move to Zampy's output.

@SarahAlidoost SarahAlidoost marked this pull request as ready for review February 2, 2024 15:14
@SarahAlidoost
Copy link
Member Author

I'm glad you were able to find a way to fix this!

I have also found that open_mfdataset can be quite slow. In cases where you have big datasets, and know well how to concatenate/merge the data, opening the files separately and then defining the merging operations manually can lead to better performance.

The code here is fine as is, it'll be mostly replaced anyway once we move to Zampy's output.

thanks. I added other changes see here, can you have another look?

Copy link
Contributor

@BSchilperoort BSchilperoort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Sarah, I just have one comment on how you set the dask config. Once that is resolved feel free to merge 👍

PyStemmusScope/global_data/cci_landcover.py Outdated Show resolved Hide resolved
Copy link

sonarqubecloud bot commented Feb 2, 2024

@SarahAlidoost SarahAlidoost merged commit e97a9f6 into main Feb 2, 2024
16 checks passed
@SarahAlidoost SarahAlidoost deleted the fix_94 branch February 2, 2024 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Model.setup() with global data fails due to high memory usage on own system and CRIB
2 participants