Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inefficient use of memory in code: dataframe copies #19

Open
svdhoog opened this issue May 26, 2018 · 3 comments
Open

Inefficient use of memory in code: dataframe copies #19

svdhoog opened this issue May 26, 2018 · 3 comments

Comments

@svdhoog
Copy link
Owner

svdhoog commented May 26, 2018

visualization/main.py, line 189-208:

[*]        d = agent_dframes[param['agent']]  # comment: this can be replaced in line below to save memory, here now just for simplicity

        # check if table columns contain the given variables from config file
        for i, entry in enumerate(var_list):
            if not (entry in list(d)):
                erf("Table has columns {0} and var{1}='{2}' does not match.".format(list(d), i+1, entry))

        # stage-I filtering, all input vars are sliced with desired set & run values
[**]   filtered = d.iloc[(d.index.get_level_values('set').isin(param['set'])) & (d.index.get_level_values('run').isin(param['run'])) & (d.index.get_level_values('major').isin(param['major'])) & (d.index.get_level_values('minor').isin(param['minor']))][var_list].dropna().astype(float)

        df_main = pd.DataFrame()
        index1 = 0
        for dkey, dval in var_dic.items():
            df = filter_by_value(dkey, dval, filtered)  # stage-II filtering for selecting variables according to their values
            if df_main.empty:
                df_main = df
            else:
                df_main = pd.concat([df_main, df], axis=1)
[***]       del df

[*] line 189: This appears to make a copy of the entire data frame in memory in the variable d.
Can this simply be resolved by copying the RHS of d= and using that in the lines below?

[**] line 197: this appears to create another data frame filtered that is used in the lines below just once, in line 202.

[***] Here df is deleted, which was the filtered data frame that was copied into df_main. Isn't this inefficient copying of data?

@svdhoog
Copy link
Owner Author

svdhoog commented May 26, 2018

Proposed change could be:

          # comment: d was replaced by the line below to save memory
[*]       # d = agent_dframes[param['agent']]  

        # check if table columns contain the given variables from config file
        for i, entry in enumerate(var_list):
            if not (entry in list(agent_dframes[param['agent']])):
                erf("Table has columns {0} and var{1}='{2}' does not match.".format(list(agent_dframes[param['agent']]), i+1, entry))

        # stage-I filtering, all input vars are sliced with desired set & run values
[**]   filtered = agent_dframes[param['agent']].iloc[(d.index.get_level_values('set').isin(param['set'])) & (d.index.get_level_values('run').isin(param['run'])) & (d.index.get_level_values('major').isin(param['major'])) & (d.index.get_level_values('minor').isin(param['minor']))][var_list].dropna().astype(float)

        df_main = pd.DataFrame()
        index1 = 0
        # stage-II filtering for selecting variables according to their values
        for dkey, dval in var_dic.items():
            df = filter_by_value(dkey, dval, filtered)  
            if df_main.empty:
                df_main = df
            else:
                df_main = pd.concat([df_main, df], axis=1)
[***]       del df

@svdhoog
Copy link
Owner Author

svdhoog commented May 26, 2018

2nd case:

visualization/main.py, line 161-163:

        d = pd.DataFrame()  # Main dataframe to hold all the dataframes of each instance (one agenttype)
        df_list = []
                  ... [constructing df_list]
[*]     d = pd.concat(df_list)  # Add each dataframe from panel into a main dataframe containing all sets and runs
[**]    del df_list
[***]   agent_dframes[agentname] = d  # this dict contains agent-type names as keys, and the corresponding dataframes as values

[*] Here df_list is concatenated/added to d
[**] Then it is deleted
[***] Now d gets copied into agent_dframes[agentname]

Can [***] not be made more efficient ?

Proposed code change

[***]   agent_dframes[agentname] = pd.concat(df_list) # like at [*] we concat df_list

@svdhoog
Copy link
Owner Author

svdhoog commented Apr 12, 2021

Python does not create entire copies of the data frame in memory. Instead it creates a view in the variable d, and passes by reference here:

d = agent_dframes[param['agent']]

The only inefficiency here is that we are creating a new DataFrame df containing the filtered data that then gets concatenated to df_main:

for dkey, dval in var_dic.items():
            df = filter_by_value(dkey, dval, filtered)  
            if df_main.empty:
                df_main = df
            else:
                df_main = pd.concat([df_main, df], axis=1)
[***]       del df

More efficient implementation
By removing the intermittent DataFrame df

for dkey, dval in var_dic.items():
            if df_main.empty:
                df_main = filter_by_value(dkey, dval, filtered)
            else:
                df_main = pd.concat([df_main, filter_by_value(dkey, dval, filtered)], axis=1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant