- Notifications
You must be signed in to change notification settings - Fork 207
Open
Labels
Description
Steps to reproduce
Configs:
# my_cpu_fleet.yml type: fleet name: cpu-default nodes: 0..8 resources: cpu: 2 # simple-service-replicas.yml type: service name: simple-service-replicas https: false python: 3.12 commands: - echo "Group default - Version 1" > /tmp/version.txt - python3 -m http.server 8000 port: 8000 resources: cpu: 2 replicas: 5 Step1: Create Fleet: dstack apply -f my_cpu_fleet.yml
Step2: Apply Service Config dstack apply -f simple-service-replicas.yml
The first run works as expected
dstack ps NAME BACKEND GPU PRICE STATUS SUBMITTED simple-service-replicas - - running 5 mins ago replica=0 aws (us-east-2) - $0.0832 running 5 mins ago replica=1 aws (us-east-2) - $0.0832 running 5 mins ago replica=2 aws (us-east-2) - $0.0832 running 5 mins ago replica=3 aws (us-east-2) - $0.0832 running 5 mins ago replica=4 aws (us-east-2) - $0.0832 running 5 mins ago dstack fleet FLEET INSTANCE BACKEND RESOURCES PRICE STATUS CREATED default - - - - - 4 days ago cpu-default 0 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 busy 4 mins ago 1 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 busy 4 mins ago 2 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 busy 4 mins ago 3 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 busy 4 mins ago 4 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 busy 4 mins ago Step3: Stop the run. fleet instances are idle as expected.
FLEET INSTANCE BACKEND RESOURCES PRICE STATUS CREATED default - - - - - 4 days ago cpu-default 0 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 idle 6 mins ago 1 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 idle 5 mins ago 2 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 idle 5 mins ago 3 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 idle 5 mins ago 4 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 idle 5 mins ago Step4: Once Again apply: dstack apply -f simple-service-replicas.yml
dstack ps NAME BACKEND GPU PRICE STATUS SUBMITTED simple-service-replicas - - running 56 sec ago replica=0 aws (us-east-2) - $0.0832 running 55 sec ago replica=1 aws (us-east-2) - $0.0832 pulling 55 sec ago replica=2 aws (us-east-2) - $0.0832 pulling 55 sec ago replica=3 aws (us-east-2) - $0.0832 pulling 55 sec ago replica=4 aws (us-east-2) - $0.0832 running 55 sec ago All the fleet instances are expected to be busy when replica's are pulling/running, but some are idle as below:
dstack fleet FLEET INSTANCE BACKEND RESOURCES PRICE STATUS CREATED default - - - - - 4 days ago cpu-default 0 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 busy 10 mins ago 1 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 idle 10 mins ago 2 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 idle 10 mins ago 3 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 idle 10 mins ago 4 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 busy 10 mins ago Step5: Check the run after a while
dstack ps NAME BACKEND GPU PRICE STATUS SUBMITTED simple-service-replicas - - terminating 3 mins ago replica=0 aws (us-east-2) - $0.0832 running 3 mins ago replica=1 aws (us-east-2) - $0.0832 terminating 3 mins ago replica=2 aws (us-east-2) - $0.0832 terminating 3 mins ago replica=3 aws (us-east-2) - $0.0832 terminating 3 mins ago replica=4 aws (us-east-2) - $0.0832 running 3 mins ago The run gets terminated.
Actual behaviour
The run gets terminated on re-run even when fleet has idle instances.
dstack fleet FLEET INSTANCE BACKEND RESOURCES PRICE STATUS CREATED default - - - - - 4 days ago cpu-default 0 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 busy 10 mins ago 1 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 idle 10 mins ago 2 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 idle 10 mins ago 3 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 idle 10 mins ago 4 aws (us-east-2) cpu=2 mem=8GB disk=100GB $0.0832 busy 10 mins ago dstack ps NAME BACKEND GPU PRICE STATUS SUBMITTED simple-service-replicas - - terminating 3 mins ago replica=0 aws (us-east-2) - $0.0832 running 3 mins ago replica=1 aws (us-east-2) - $0.0832 terminating 3 mins ago replica=2 aws (us-east-2) - $0.0832 terminating 3 mins ago replica=3 aws (us-east-2) - $0.0832 terminating 3 mins ago replica=4 aws (us-east-2) - $0.0832 running 3 mins ago Expected behaviour
The re-run should not be terminated and idle fleet instances should be utilized.
dstack version
master (commit: b2be6a7)