[SERVE][AUTOSCALERS] Replica scaling sampling period and stability. #4444

JGSweets · 2024-12-05T17:30:42Z

In autoscalers.py within serve:

Lines 258 to 269 in 3f62588

    
           elif target_num_replicas > self.target_num_replicas: 
        
               self.upscale_counter += 1 
        
               self.downscale_counter = 0 
        
               if self.upscale_counter >= self.scale_up_consecutive_periods: 
        
                   self.upscale_counter = 0 
        
                   self.target_num_replicas = target_num_replicas 
        
           elif target_num_replicas < self.target_num_replicas: 
        
               self.downscale_counter += 1 
        
               self.upscale_counter = 0 
        
               if self.downscale_counter >= self.scale_down_consecutive_periods: 
        
                   self.downscale_counter = 0 
        
                   self.target_num_replicas = target_num_replicas

When a single qps check is below or above the threshold, the downscale_counter or upscale_counter is set to 0.
This means a single jitter in qps could disrupt scaling.

I propose we allow a sampling over a period to allow scaling to occur based on a percentage of occurrences vs resetting to 0.
This could be set in the scaling policy.

Also, since scaling utilizes math.ceil, it errors on scaling and keeping qps below the value as a max bar vs a target.

skypilot/sky/serve/autoscalers.py

Line 192 in 3f62588

target_num_replicas = math.ceil(

Version & Commit info:

sky -v: 0.7.0
sky -c: 3f62588

The text was updated successfully, but these errors were encountered:

Michaelvll · 2024-12-12T06:05:08Z

cc'ing @cblmemo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SERVE][AUTOSCALERS] Replica scaling sampling period and stability. #4444

[SERVE][AUTOSCALERS] Replica scaling sampling period and stability. #4444

JGSweets commented Dec 5, 2024 •

edited

Loading

Michaelvll commented Dec 12, 2024

[SERVE][AUTOSCALERS] Replica scaling sampling period and stability. #4444

[SERVE][AUTOSCALERS] Replica scaling sampling period and stability. #4444

Comments

JGSweets commented Dec 5, 2024 • edited Loading

Michaelvll commented Dec 12, 2024

JGSweets commented Dec 5, 2024 •

edited

Loading