-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leaks in gbt.load() #55
Comments
Howdy. @sondreus had reached out to me as we have had a few earlier DMs about his most excellent model and its use of The package is in good standing on CRAN, and Prof Ripley keeps us all honest by running checks with the (very powerful) And I can replicate that with a quick The fact that @sondreus sees increase memory usage may have mean that R assigns new objects it keeps track of. It is tricky to disambiguate that. I'll think some but at glance the package seems to not be doing anything. Happy to help some more and chat but no smoking gun as far as I can tell. PS I always find it instructive to have a baseline. So putting, say, these two lines in a file and running them as above with Rcpp::cppFunction("double leakMe() { double *p = new double[1000]; return p[0]; }")
leakMe() yields
We have none of that in |
Thanks @eddelbuettel! Look forward to hear from @Blunde1 who might look into this too. My guess is that my language is imprecise, so that there is no memory leak in the proper sense, just that when run many times, gbt.load takes up ever more memory (I got a colleague to replicate this to be sure), even as the loaded objects are no longer usable by any process from within R. All the best, |
Yes. R uses copy-on-write internally so a changed object always becomes a new object (where plain assignment copied I had massive headaches with things like too in the past (though maybe mostly in the distant past, on Windows, with smaller memory systems) so I don't have a great set of recommendations. A simple In theory, "maybe" we could bring the loading of the model to the C++ side (where we have finer control) and use accessors for the data. In practice that maybe harder because ... these model structures are larger. But I guess they are regular? We could possibly try alternate serialization from the C++ side too. I'd be game for helping a little but I too would love to hear |
As the cliche goes, I just had some more thoughts about this while on the morning run (and in the shower, cliche of cliches). To clarify: @sondreus you have this issue when bootstrapping / looping over N model samples and fits, correct? I may have an idea or two (which do not require @Blunde1 to change anything) so maybe you and can (and should) take this to a scrap repo to noodle around? |
@eddelbuettel Yes indeed. I've "solved" the issue by dividing this portion of the update process into two batched, and moving the offending portion to a separate script and running it in a new R process which closes upon completion (so that the rest of the update process won't run out of memory). |
That is pretty much what I was about to suggest. I happen to use external scripts lot and really like batching for R -- and wrote littler to give us |
@sondreus Thanks, I am looking into this. I observe the same as you, and also from running @eddelbuettel thanks for the checks and running
I believe the (de-)serialization can be improved upon, and would love to hear your thoughts on it. The models are regular in the sense that they are deterministic and in essence consists of nodes that are easily traversable (directed graph), thus the same logic may be applied all over. Do you know of any code/packages from which inspiration could/should be drawn on? In particular, are there any "using accessors for the data"-examples you could share? |
Yes, it's a weird issue and it would be good to get a handle on it. We may still be doing something wrong but it apparently it is not something obvious. Serialization is a fantastic (and complicated !!) topic. Per se you are doing nothing wrong, but of course writing "in text" is typically as seen as the most expensive way (formating out the way out, parsing on the way in). It is also human-readable and portable so there is that on the plus side. If you are only serializing 'for yourself here' (i.e. agtboost writes and reads) then saveRDS / loadRDS will be faster at the expense of human readability (and portability). I personally like fst an awful lot: it is very fast and backed by a library part one could use elsewhere (directly from C++, to interface Python, ...) And then there are msgpack and feather and writing to database (SQLite is good for this too) and more. You can go and on. But you don't have to. Simple is good too. For now, I think, we need to understand why/how the memory use grows. Maybe try another baseline where we just repeatedly assign a matrix in which we change one or two values (to trigger R's copy-on-write) ? I would think that by assigning a new one in loop 'i' it would free / recover the memory from iteration 'i-1'. Why this does not happen here is odd. (And I think I am more puzzled about R here than about agtboost but we should double check). |
@eddelbuettel Thanks!
and then use the
From running the last two lines incrementally I seem to experience the same memory issue. This is seen from visual inspection in RStudio, and from inspecting the rsession object in the Activity Monitor (MacOS). I will see if this is replicated on other systems. Please also tell me if I am doing something wrong here, and it would be interesting if you also observe this behaviour. If you also indeed observe the above behaviour, I would assume it is not a problem with agtboost but rather some weird (because I do not understand it 😄) behaviour on the language-memory-handling side. Which seems to be aligned with your comments above. |
That's exactly what's needed and maybe also a possible fix I had not tought about. The long story is that some time many years ago (but after I had started using R too) the Core team changed string representation and made it much better. (It was fairly awful before and one of the reasons for the old bias of using matrices over data.frames or anything indexed by rownames as those used to be expensive.) String got a better hashed representation, but that representation is still different from numbers ( Moreover, these are not shared between C++'s std::string and R''s whereas we shared int or double vectors! So I think when we invoke This may be different if we used But it gets us back to @sondreus issue because your serialization is ... vectors of strings, right? We might avoid it all if we went to storing what are for just numbers as numbers. That may be more work, and more fragile, so we would make sure it is worth it. But great existence proof above. This may lead us down a road to improve the issue. |
@eddelbuettel thanks, the long story and its insights is appreciated! I think there is no way I would have understood this on my own. I changed the attribute of For agtboost, a model object holds the loss function as an |
Really nice work @Blunde1! I think the 'how strings are represented' is a bit of an insider story I have happened to have picked up on. I'd be up for cotinuing to discuss and prototype (and help, once this vacation is over,,,,) with a 'more efficient' serializartion approach. This is essentially undocumented stuff so if we can put up a better reference implementation it may help others. |
Sorry for the silence, but had some time off. I just set up a very minimal check to see if our hypothesis concerning string representation holds. I made these two minimal changes to save and load as rds if the filename ends in rds (very simplistic ...) modified R-package/R/gbt.load.R
@@ -23,7 +23,12 @@ gbt.load <- function(file)
{
# check valid file path
- mod <- new(ENSEMBLE)
- mod$load_model(file)
- return(mod)
-}
\ No newline at end of file
+ if (grepl("rds$", file)) {
+ mod <- readRDS(file)
+ mod
+ } else {
+ mod <- new(ENSEMBLE)
+ mod$load_model(file)
+ mod
+ }
+}
modified R-package/R/gbt.save.R
@@ -47,7 +47,10 @@ gbt.save <- function(gbt_model, file)
if(length(error_messages)>0)
stop(error_messages)
- # Store model in file
- gbt_model$save_model(file)
+ ## Store model in file
+ if (grepl("rds$", file))
+ saveRDS(gbt_model, file)
+ else
+ gbt_model$save_model(file)
}
And based on a casual check of loading the (same) file fifty times, memory use seems stable as it should be. Let me know if this helps, I can probably expand on it. |
@eddelbuettel thanks -- time off is time off which needs to be spent wisely I am realizing more and more. Here we just had our few good days of weather for the summer, which warranted time outside. I will try this as soon as possible. It would be amazing if this indeed works! I will still try to change the model class as intended by introducing an enum loss-function attribute, and some general refactoring, but perhaps that can be done after introducing your solution as it might very well resolve the issue. 👏 As a side-note I remember "studying up" on serialization and in particular for C++ objects from R, and thought that I had unsuccessfully tried the save/loadRDS format at that time. Thus I am very glad if this indeed works, and appreciate the help immensely. 😄 |
It's easy to get lost in the weeds there. I just accessed RDS input/output from R. But as you mention it, I once needed it for Redis so I extracted the (otherwise 'hidden' and unexported) R code here in this repo (and on CRAN) but we should not need this. You know your model data so much better than I do but I tried a few tricks over the years with serialization -- if RDS works for you I'd stick with that for now. |
@eddelbuettel I created an early PR based on your suggestion here I get an Error when running the suggested code
which seems to be similar to my earlier attempts in the past, I think. I googled a little bit, and stumbled upon (again) the ever-helpful Rcpp Gallery with this post on serialization of C++ objects. Perhaps I should do this first and then call save/loadRDS? I would be surprised if this did not work, but it would however require further dependencies... While thinking about the serialization I will continue to fix the string issues in #58 as I believe this might also resolve the memory issues. |
Howdy. Happy to get more involved and send you what I had as a PR (or push it to a fork, or ...) but can you in a first instance debug locally? What I offered (to switch based on the final characters in the file) may be 'too loose'. It may be simpler to use your load and save from ascii as before, and to explicitly call readRDS() and saveRDS(). The Rcpp Gallery piece uses a different serialization (from Boost), we may not want that. I do have a helper package RApiSerialize (on CRAN for many years, I happen to have just sent a minor update to CRAN) but that is only needed if you want to serialize from C++ which we may not need here (but I guess we could). The error message above seems to suggest that some wires are crossed for the Rcpp Module part. |
I think I have detected a memory leak in gbt.load. Could be I'm failing to understand how R handles memory (and the imperfections of that process), but maybe adding a manual garbage collection in the C subroutine could be the way to solve this?
Minimal replication:
Memory usage (by the R process, not the objects in R workspace) increments on my system by roughly 1mb every time the below line is run:
The text was updated successfully, but these errors were encountered: