- Notifications
You must be signed in to change notification settings - Fork 207
Add replica groups in dstack-service#3408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base:master
Are you sure you want to change the base?
Add replica groups in dstack-service #3408
Uh oh!
There was an error while loading. Please reload this page.
Conversation
Bihan commented Dec 20, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
Bihan commented Dec 20, 2025
Will be solving merge conflicts as review continues. |
Bihan commented Dec 20, 2025
Related PRs#3205 from @DragonStuff |
peterschmidt85 commented Dec 20, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
@Bihan Do we really need replica group names? |
peterschmidt85 commented Dec 20, 2025
@Bihan Also please check the conflicts with |
peterschmidt85 commented Dec 20, 2025
Cosmetics only: I would rename |
Bihan commented Dec 22, 2025
Yes. will rename it. |
Bihan commented Dec 22, 2025
Yes. Without replica names, we would rely on indices, which are position-dependent. If groups are reordered by users during manual scaling, indices shift, but existing jobs and persisted state (like desired_replica_counts) still reference the old positions. This mismatch prevents reliable identification of which group a job belongs to, leading to incorrect scaling decisions. Replica names are not affected by reordering in the YAML file. Initial Manual Scaling Instead of relying on replica group's position in the config, another possibility is matching job specs to identify replicas; but this approach fails during rolling deployments because old and new jobs from the same group have different specs. |
peterschmidt85 commented Dec 22, 2025
As a user I find it unnecessary to give names. I would prefer not to ask names if this is possible technically. |
Bihan commented Dec 22, 2025
If a user changes commands for group 0 and reorders groups at the same time, they expect a rolling deployment for group 0 only. However, the system detects the order change and triggers a full redeployment for all groups. Users may find this implicit behavior annoying because it provisions extra instances for each groups. |
peterschmidt85 commented Dec 22, 2025
Perhaps we could make these names optional? |
Bihan commented Dec 22, 2025
Yes, we can make it optional. |
add_replica_groups_model Replica Groups AutoScaling Rolling deployment and UI Replica Groups implementation clean up
86139c5 to 5abbcadCompare| ) | ||
| classReplicaGroup(ConfigurationWithCommandsParams, CoreModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) I think we could allow to set many more properties per replica group. If the user can set commands, they may also want to set entrypoint, working_dir, image, volumes, repos, etc. And if the user can set resources, they may also want to set instance_types, spot_policy, reservation, etc.
Although it may be a good idea to leave this to a future iteration, because some properties may be non-trivial to support correctly
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
| new_job_specs=awaitget_job_specs_from_run_spec( | ||
| run_spec=run_spec, | ||
| run_spec=group_run_spec, | ||
| secrets=secrets, | ||
| replica_num=replica_num, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Job specs from new_job_specs all have replica_group set to "default", which doesn't match the job specs of existing jobs. This prevents in-place.
To reproduce
Apply this conf:
name: testport: 8000priority: 1replicas: - name: r1count: 1commands: - sleep infinityThen apply this conf:
name: testport: 8000priority: 2replicas: - name: r1count: 1commands: - sleep infinityExpected
Changing priority does not cause replica redeployment — the job's deployment_num is updated in-place.
Actual
The job is redeployed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved.
| run_spec=spec, | ||
| secrets=secrets, | ||
| replica_num=0, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, all job specs returned by /api/project/{project_name}/runs/get_plan always have replica_group="default", even if the configuration specifies other replica groups. This doesn't affect UX right now, but it will be a problem once we try to implement per-group offers display in the run plan as you suggested earlier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved
| jobs=awaitget_jobs_from_run_spec( | ||
| run_spec=group_run_spec, | ||
| secrets=secrets, | ||
| replica_num=global_replica_num, | ||
| ) | ||
| forjobinjobs: | ||
| assertreplica_group.nameisnotNone, ( | ||
| "ReplicaGroup name should be set by validator" | ||
| ) | ||
| job.job_spec.replica_group=replica_group.name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've noticed this pattern in several places in the PR — you create the job using get_jobs_from_run_spec, then mutate its job_spec.replica_group.
Mutating job_spec.replica_group is not an obvious responsibility, so it is very easy to forget doing so. For example, the issues from my two previous comments (1, 2) appear to be caused by not settingreplica_group properly.
Maybe get_jobs_from_run_spec could accept replica_group_name as an argument? The function would then pass it down the call stack, possibly until it reaches JobConfigurator._get_job_spec, where it would be assigned to the job spec when constructing it. This would remove the implicit responsibility of setting job_spec.replica_group after constructing the job
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the right way. Thanks for the idea. Will do it accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented.
Uh oh!
There was an error while loading. Please reload this page.
| group_run_spec=create_group_run_spec( | ||
| base_run_spec=run_spec, | ||
| replica_group=replica_group, | ||
| ) | ||
| jobs=awaitget_jobs_from_run_spec( | ||
| run_spec=group_run_spec, | ||
| secrets=secrets, | ||
| replica_num=global_replica_num, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern can be seen in many places in the PR — you are converting the run spec to a group run spec, then passing it to get_jobs_from_run_spec. Similarly to my other comment, this pattern introduces an implicit responsibility for get_jobs_from_run_spec callers, which can be error-prone.
Also, the fact that RunSpec can now be used for both runs and replica groups makes it a bit more difficult to understand the code — now, when you see a RunSpec object, you can't always be sure if it's actually a run spec or a replica group spec.
Maybe we could avoid this conversion from run spec to group run spec throughout the codebase? For example, if get_jobs_from_run_spec will have the replica_group_name parameter and pass it on to the job configurator as I suggested earlier, maybe it could be the job configurator's responsibility to consider both the run spec and the replica group params when building job specs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To remove the conversion from run spec to group run spec throughout the code base, which can be error-prone. I plan to update JobConfigurator and ServiceJobConfigurator as below.
class JobConfigurator(ABC): TYPE: RunConfigurationType _image_config: Optional[ImageConfig] = None # JobSSHKey should be shared for all jobs in a replica for inter-node communication. _job_ssh_key: Optional[JobSSHKey] = None def __init__( self, run_spec: RunSpec, secrets: Optional[Dict[str, str]] = None, replica_group_name: Optional[str] = None, # <- Pass replica group name to configurator ): self.run_spec = run_spec self.secrets = secrets or{} self.replica_group_name = replica_group_name # <- Store for use in _shell_commands() and _requirements() ... async def _get_job_spec( self, replica_num: int, job_num: int, jobs_per_replica: int, ) -> JobSpec: job_spec = JobSpec( replica_num=replica_num, job_num=job_num, job_name=f"{self.run_spec.run_name}-{job_num}-{replica_num}", jobs_per_replica=jobs_per_replica, replica_group=self.replica_group_name or "default", # <- Set replica_group # ... other fields ... ) return job_spec def _requirements(self, jobs_per_replica: int) -> Requirements: resources = self.run_spec.configuration.resources # Look up replica group and use its resources if type service. if self.run_spec.configuration.type == "service": for group in self.run_spec.configuration.replica_groups: if group.name == self.replica_group_name: resources = group.resources # Use group-specific resources break spot_policy = self._spot_policy() return Requirements( resources=resources, max_price=self.run_spec.merged_profile.max_price, spot=None if spot_policy == SpotPolicy.AUTO else (spot_policy == SpotPolicy.SPOT), reservation=self.run_spec.merged_profile.reservation, multinode=jobs_per_replica > 1, ) class ServiceJobConfigurator(JobConfigurator): TYPE: RunConfigurationType = RunConfigurationType.SERVICE def _shell_commands(self) -> List[str]: assert self.run_spec.configuration.type == "service" # Look up replica group and use its commands. for group in self.run_spec.configuration.replica_groups: if group.name == self.replica_group_name: return group.commands # Use group-specific commands # This shouldn't happen - raise error raise ValueError(f"Replica group '{self.replica_group_name}' not found in service configuration. ") peterschmidt85 commented Jan 1, 2026
@Bihan the test seem to be broken. Have you seen it? |
Steps To Test
Step1: Create
replica-groups-service.ymlStep2:
dstack apply -f replica-groups-service.ymlStep3: Run
load_test_replica_groups.pyby subsituting yourURLandTOKENExpected Output
Each group gets one replica
Later, both groups scale respecting group configs.
group0 scales to 2 replicas,
and group1 scales to 3.
Below is the expected output
Step4: Check whether replica specific commands were executed.
Attach to the desired replica
Eg:
dstack attach -replica 2 replica-groups-testssh replica-groups-test-0-2 'cat /tmp/version.txt'output: Group 1 - Version 0Step5: Check rolling deployment.
Important:
Rolling deployments are currently affected by a race condition that also impacts the non–replica group implementation and must be addressed separately (issue). However, when each replica group is configured with a single replica, this race condition does not affect rolling deployments.
Testing instructions:
Scale down each replica group to 1 replica.
Restart the load-testing script with RPS = 2.
After all groups have scaled down to a single replica, re-apply the configuration:
Re-apply
dstack apply -f replica-groups-service.yml