Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any way to inform ETA estimate with prior info? #142

Open
mike-lawrence opened this issue Aug 25, 2022 · 3 comments
Open

Any way to inform ETA estimate with prior info? #142

mike-lawrence opened this issue Aug 25, 2022 · 3 comments

Comments

@mike-lawrence
Copy link

mike-lawrence commented Aug 25, 2022

I'm working on a pipeline-prototyping package that uses progressr to track progress as many elements of a list are processed and where efficient re-processing is enabled by skipping over already processed elements. This results in a scenario where I gather the current ETA calculations of progressr become inaccurate as early elements that have already been successfully processed return from the processing function quickly while later elements can be expected to take much longer. I save history data associated with each element, including processing time duration, and I'm wondering if it'd be possible to provide that duration to progressr somehow to leave it with more accurate ETA estimates. A minimal example of my case is:

options(
	progressr.handlers = progressr::handler_progress(
		format   = "[:bar] :spin :current/:total :percent in :elapsed ETA: :eta (:message)"
		, clear = FALSE
	)
)
options(progressr.clear=FALSE)
f = function(.x,pb){
    x_hash = digest::digest(.x)
    if(file.exists(x_hash)){
        t = readRDS(x_hash)
    }else{
        t = runif(1,10,20)
        Sys.sleep(t)
        if(.x%%2){  # odd elements get saved, triggering skip next time they're processed
            saveRDS(t,x_hash)
        }
    }
    pb(duration = t) #imagined API for supplying the duration manually
    
}

#on first run, ETA is accurate
progressr::with_progress({
    .x = 1:10
    pb <- progressr::progressor(along = .x)
    y = furrr::future_map(.x=.x,.f=f,pb=pb,.progress=F)
})

#on second run, some elements are skipped internally by f()
progressr::with_progress({
    .x = 1:10
    pb <- progressr::progressor(along = .x)
    y = furrr::future_map(.x=.x,.f=f,pb=pb,.progress=F)
})
@HenrikBengtsson
Copy link
Collaborator

HenrikBengtsson commented Aug 25, 2022

The progressr package is not aware of time. The task for estimating ETA is passed on to the underlying progress handler, e.g. the progress package (as here). This was a deliberate choice to keep the progressr API as simple as possible, especially in its early stage of life.

One idea I had in the past related to this was to signal that some progress steps are skipped whereas others are not (the default), e.g.

pb(skip = TRUE)

or possibly

pb(skip = 0.9)

where skip is a fraction in [0,1] where skip=1.0 is full skip ("zero cost"), and skip=0.9 (10% cost), etc. (Maybe, cost is a better name, where cost = 1.0 is the default)

Then the progress handler can do whatever they'd like with that info, e.g. adjust their ETA estimates, or completely ignore it. See also r-lib/progress#120 for my proposal to the progress package. This is an idea that needs to maturity and be explored, e.g. is skip the most generic approach (where ETA is just a special case), is it sufficient, or do we need more.

@tripartio
Copy link

tripartio commented Jan 30, 2024

progressr::handlers('cli') supports very accurate ETA estimates by default with no further code needed. I am especially surprised about how good the estimates are even with parallel processing with the {future} framework.

@HenrikBengtsson
Copy link
Collaborator

progressr::handlers('cli') supports very accurate ETA estimates by default with no further code needed. I am especially surprised about how good the estimates are even with parallel processing with the {future} framework.

Thanks for sharing your experience and observations.

FWIW, it makes sense that the ETA works equally well in parallel. cli cannot even tell if we're running sequentially or parallelly; all it sees is an incoming stream of progress updates. The only difference, when running in parallel, is that they'll arrive more frequently. So, cli just concludes that things are running faster and adjusts its ETA accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants