Setting chunks auto in open_mfdataset #95

SarahAlidoost · 2024-02-02T12:30:15Z

close #94

In this PR:

Set chunks to "auto" to avoid memory issues in xr.open_mfdataset, because, by default, chunks will be chosen to load entire input files into memory at once. see doc.
In timestep "1800S", S is replaced with s to fix pandas: FutureWarning: 'S' is deprecated and will be removed in a future version, please use 's' instead. This works also for pandas < 2, see source code.
Set dask.config.set({"array.slicing.split_large_chunks": True}) to avoid creating the large chunk, because of PerformanceWarning: Slicing is producing a large chunk, see doc.

There is still another PerformanceWarning: Increasing number of chunks by factor . This is due to internal re-chunking and might be solved by zampy. see dask source code.

BSchilperoort

I'm glad you were able to find a way to fix this!

I have also found that open_mfdataset can be quite slow. In cases where you have big datasets, and know well how to concatenate/merge the data, opening the files separately and then defining the merging operations manually can lead to better performance.

The code here is fine as is, it'll be mostly replaced anyway once we move to Zampy's output.

SarahAlidoost · 2024-02-02T15:16:01Z

I'm glad you were able to find a way to fix this!

I have also found that open_mfdataset can be quite slow. In cases where you have big datasets, and know well how to concatenate/merge the data, opening the files separately and then defining the merging operations manually can lead to better performance.

The code here is fine as is, it'll be mostly replaced anyway once we move to Zampy's output.

thanks. I added other changes see here, can you have another look?

BSchilperoort

Hi Sarah, I just have one comment on how you set the dask config. Once that is resolved feel free to merge 👍

PyStemmusScope/global_data/cci_landcover.py

sonarqubecloud · 2024-02-02T16:23:18Z

Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

add chunks auto to open_mfdataset to avoid memory problems

d3ffc6b

SarahAlidoost changed the title ~~Fix memory problems~~ Setting chunks auto in open_mfdataset Feb 2, 2024

replace S with s as S is deprecated in pandas

5b9585c

BSchilperoort approved these changes Feb 2, 2024

View reviewed changes

SarahAlidoost added 5 commits February 2, 2024 15:38

fix dask PerformanceWarning: Slicing is producing a large chunk

fa998c6

fix mypy type errors

7effb64

fix ruff lint errors

e1b1465

fix isort errors

3c9ee7a

fix black errors

76264fe

SarahAlidoost marked this pull request as ready for review February 2, 2024 15:14

SarahAlidoost requested a review from BSchilperoort February 2, 2024 15:35

BSchilperoort approved these changes Feb 2, 2024

View reviewed changes

PyStemmusScope/global_data/cci_landcover.py Outdated Show resolved Hide resolved

SarahAlidoost added 2 commits February 2, 2024 17:13

refactor dask.config.set

6a08b92

make linter happy

bb3b7f9

SarahAlidoost merged commit e97a9f6 into main Feb 2, 2024
16 checks passed

SarahAlidoost deleted the fix_94 branch February 2, 2024 16:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting chunks auto in open_mfdataset #95

Setting chunks auto in open_mfdataset #95

SarahAlidoost commented Feb 2, 2024 •

edited

Loading

BSchilperoort left a comment

SarahAlidoost commented Feb 2, 2024

BSchilperoort left a comment

sonarqubecloud bot commented Feb 2, 2024

Setting chunks auto in open_mfdataset #95

Setting chunks auto in open_mfdataset #95

Conversation

SarahAlidoost commented Feb 2, 2024 • edited Loading

BSchilperoort left a comment

Choose a reason for hiding this comment

SarahAlidoost commented Feb 2, 2024

BSchilperoort left a comment

Choose a reason for hiding this comment

sonarqubecloud bot commented Feb 2, 2024

Quality Gate passed

SarahAlidoost commented Feb 2, 2024 •

edited

Loading