Skip to content

[Bug]: Service re-run terminates despite available fleet capacity.#3403

@Bihan

Description

@Bihan

Steps to reproduce

Configs:

# my_cpu_fleet.yml type: fleet name: cpu-default nodes: 0..8 resources: cpu: 2 
# simple-service-replicas.yml type: service name: simple-service-replicas https: false python: 3.12 commands: - echo "Group default - Version 1" > /tmp/version.txt - python3 -m http.server 8000 port: 8000 resources: cpu: 2 replicas: 5 

Step1: Create Fleet: dstack apply -f my_cpu_fleet.yml

Step2: Apply Service Config dstack apply -f simple-service-replicas.yml

The first run works as expected

dstack ps NAME BACKEND GPU PRICE STATUS SUBMITTED simple-service-replicas - - running 5 mins ago replica=0 aws (us-east-2) - $0.0832 running 5 mins ago replica=1 aws (us-east-2) - $0.0832 running 5 mins ago replica=2 aws (us-east-2) - $0.0832 running 5 mins ago replica=3 aws (us-east-2) - $0.0832 running 5 mins ago replica=4 aws (us-east-2) - $0.0832 running 5 mins ago 
dstack fleet FLEET INSTANCE BACKEND RESOURCES PRICE STATUS CREATED default - - - - - 4 days ago cpu-default 0 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 busy 4 mins ago 1 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 busy 4 mins ago 2 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 busy 4 mins ago 3 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 busy 4 mins ago 4 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 busy 4 mins ago 

Step3: Stop the run. fleet instances are idle as expected.

FLEET INSTANCE BACKEND RESOURCES PRICE STATUS CREATED default - - - - - 4 days ago cpu-default 0 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 idle 6 mins ago 1 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 idle 5 mins ago 2 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 idle 5 mins ago 3 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 idle 5 mins ago 4 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 idle 5 mins ago 

Step4: Once Again apply: dstack apply -f simple-service-replicas.yml

dstack ps NAME BACKEND GPU PRICE STATUS SUBMITTED simple-service-replicas - - running 56 sec ago replica=0 aws (us-east-2) - $0.0832 running 55 sec ago replica=1 aws (us-east-2) - $0.0832 pulling 55 sec ago replica=2 aws (us-east-2) - $0.0832 pulling 55 sec ago replica=3 aws (us-east-2) - $0.0832 pulling 55 sec ago replica=4 aws (us-east-2) - $0.0832 running 55 sec ago 

All the fleet instances are expected to be busy when replica's are pulling/running, but some are idle as below:

dstack fleet FLEET INSTANCE BACKEND RESOURCES PRICE STATUS CREATED default - - - - - 4 days ago cpu-default 0 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 busy 10 mins ago 1 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 idle 10 mins ago 2 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 idle 10 mins ago 3 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 idle 10 mins ago 4 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 busy 10 mins ago 

Step5: Check the run after a while

dstack ps NAME BACKEND GPU PRICE STATUS SUBMITTED simple-service-replicas - - terminating 3 mins ago replica=0 aws (us-east-2) - $0.0832 running 3 mins ago replica=1 aws (us-east-2) - $0.0832 terminating 3 mins ago replica=2 aws (us-east-2) - $0.0832 terminating 3 mins ago replica=3 aws (us-east-2) - $0.0832 terminating 3 mins ago replica=4 aws (us-east-2) - $0.0832 running 3 mins ago 

The run gets terminated.

Actual behaviour

The run gets terminated on re-run even when fleet has idle instances.

dstack fleet FLEET INSTANCE BACKEND RESOURCES PRICE STATUS CREATED default - - - - - 4 days ago cpu-default 0 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 busy 10 mins ago 1 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 idle 10 mins ago 2 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 idle 10 mins ago 3 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 idle 10 mins ago 4 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 busy 10 mins ago 
dstack ps NAME BACKEND GPU PRICE STATUS SUBMITTED simple-service-replicas - - terminating 3 mins ago replica=0 aws (us-east-2) - $0.0832 running 3 mins ago replica=1 aws (us-east-2) - $0.0832 terminating 3 mins ago replica=2 aws (us-east-2) - $0.0832 terminating 3 mins ago replica=3 aws (us-east-2) - $0.0832 terminating 3 mins ago replica=4 aws (us-east-2) - $0.0832 running 3 mins ago 

Expected behaviour

The re-run should not be terminated and idle fleet instances should be utilized.

dstack version

master (commit: b2be6a7)

Server logs

Additional information

server_logs_fleet_issue.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingmajor

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions