-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Setting chunks auto in open_mfdataset #95
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad you were able to find a way to fix this!
I have also found that open_mfdataset
can be quite slow. In cases where you have big datasets, and know well how to concatenate/merge the data, opening the files separately and then defining the merging operations manually can lead to better performance.
The code here is fine as is, it'll be mostly replaced anyway once we move to Zampy's output.
thanks. I added other changes see here, can you have another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Sarah, I just have one comment on how you set the dask config. Once that is resolved feel free to merge 👍
Quality Gate passedKudos, no new issues were introduced! 0 New issues |
close #94
In this PR:
chunks
to"auto"
to avoid memory issues inxr.open_mfdataset
, because, by default, chunks will be chosen to load entire input files into memory at once. see doc.S
is replaced withs
to fixpandas: FutureWarning: 'S' is deprecated and will be removed in a future version, please use 's' instead.
This works also for pandas < 2, see source code.dask.config.set({"array.slicing.split_large_chunks": True})
to avoid creating the large chunk, because ofPerformanceWarning: Slicing is producing a large chunk
, see doc.There is still another
PerformanceWarning: Increasing number of chunks by factor
. This is due to internal re-chunking and might be solved by zampy. see dask source code.