-
Notifications
You must be signed in to change notification settings - Fork 146
(torchx/specs) Allow roles to specify their own workspaces #1139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@kiukchung has exported this pull request. If you are a Meta employee, you can view the originating Diff in D83793199. |
@kiukchung has exported this pull request. If you are a Meta employee, you can view the originating Diff in D83793199. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1139 +/- ##
==========================================
+ Coverage 91.63% 91.64% +0.01%
==========================================
Files 83 83
Lines 6431 6441 +10
==========================================
+ Hits 5893 5903 +10
Misses 538 538
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…rch#1139) Summary: So far you can only specify the workspace in the runner's API: ``` runner.dryrun(appdef, cfg, workspace=...) ``` in which case the workspace applies to ONLY `role[0]`. This behavior was intentional since multi-role usecases of TorchX typically had a single "main" role that the application owner actually owned and the other roles were prepackaged apps (not part of your project). This is no longer the case with applications such as reenforcement learning where the project encompasses multiple applications (e.g. trainer, generator, etc) therefore we need a more flexible way to specify a workspace per Role. For BC this I'm maintaining the following behavior: 1. If `workspace` is specified as a runner argument then it takes precedence over `role[0].workspace` 2. Non-zero roles (e.g. `role[1], role[2], ...`) are unaffected by the workspace argument. That is their workspace attributes (e.g. `role[1].workspace`) are respected as is. 3. "disabling" workspace (e.g. passing `workspace=None` from the runner argument) can still build a workspace if the role's workspace attribute is not `None`. NOTE: we need to do further optimization for cases where multiple roles have the same "image" and "workspace". In this case we only need to build the image+workspace once. But as it stands we end up building a separate ephemeral per role (even if the ephemeral is the SAME across all the roles). This isn't an issue practically since image builders like Docker are content addressed and caches layers. Reviewed By: AbishekS Differential Revision: D83793199
b748dcd
to
bedb24a
Compare
…rch#1139) Summary: So far you can only specify the workspace in the runner's API: ``` runner.dryrun(appdef, cfg, workspace=...) ``` in which case the workspace applies to ONLY `role[0]`. This behavior was intentional since multi-role usecases of TorchX typically had a single "main" role that the application owner actually owned and the other roles were prepackaged apps (not part of your project). This is no longer the case with applications such as reenforcement learning where the project encompasses multiple applications (e.g. trainer, generator, etc) therefore we need a more flexible way to specify a workspace per Role. For BC this I'm maintaining the following behavior: 1. If `workspace` is specified as a runner argument then it takes precedence over `role[0].workspace` 2. Non-zero roles (e.g. `role[1], role[2], ...`) are unaffected by the workspace argument. That is their workspace attributes (e.g. `role[1].workspace`) are respected as is. 3. "disabling" workspace (e.g. passing `workspace=None` from the runner argument) can still build a workspace if the role's workspace attribute is not `None`. NOTE: we need to do further optimization for cases where multiple roles have the same "image" and "workspace". In this case we only need to build the image+workspace once. But as it stands we end up building a separate ephemeral per role (even if the ephemeral is the SAME across all the roles). This isn't an issue practically since image builders like Docker are content addressed and caches layers. Reviewed By: AbishekS Differential Revision: D83793199
bedb24a
to
12526d4
Compare
…rch#1139) Summary: So far you can only specify the workspace in the runner's API: ``` runner.dryrun(appdef, cfg, workspace=...) ``` in which case the workspace applies to ONLY `role[0]`. This behavior was intentional since multi-role usecases of TorchX typically had a single "main" role that the application owner actually owned and the other roles were prepackaged apps (not part of your project). This is no longer the case with applications such as reenforcement learning where the project encompasses multiple applications (e.g. trainer, generator, etc) therefore we need a more flexible way to specify a workspace per Role. For BC this I'm maintaining the following behavior: 1. If `workspace` is specified as a runner argument then it takes precedence over `role[0].workspace` 2. Non-zero roles (e.g. `role[1], role[2], ...`) are unaffected by the workspace argument. That is their workspace attributes (e.g. `role[1].workspace`) are respected as is. 3. "disabling" workspace (e.g. passing `workspace=None` from the runner argument) can still build a workspace if the role's workspace attribute is not `None`. NOTE: we need to do further optimization for cases where multiple roles have the same "image" and "workspace". In this case we only need to build the image+workspace once. But as it stands we end up building a separate ephemeral per role (even if the ephemeral is the SAME across all the roles). This isn't an issue practically since image builders like Docker are content addressed and caches layers. Reviewed By: AbishekS Differential Revision: D83793199
12526d4
to
0c92e58
Compare
Summary:
So far you can only specify the workspace in the runner's API:
in which case the workspace applies to ONLY
role[0]
.This behavior was intentional since multi-role usecases of TorchX typically had a single "main" role that the application owner actually owned and the other roles were prepackaged apps (not part of your project).
This is no longer the case with applications such as reenforcement learning where the project encompasses multiple applications (e.g. trainer, generator, etc) therefore we need a more flexible way to specify a workspace per Role.
For BC this I'm maintaining the following behavior:
If
workspace
is specified as a runner argument then it takes precedence overrole[0].workspace
Non-zero roles (e.g.
role[1], role[2], ...
) are unaffected by the workspace argument. That is their workspace attributes (e.g.role[1].workspace
) are respected as is."disabling" workspace (e.g. passing
workspace=None
from the runner argument) can still build a workspace if the role's workspace attribute is notNone
.NOTE: we need to do further optimization for cases where multiple roles have the same "image" and "workspace". In this case we only need to build the image+workspace once. But as it stands we end up building a separate ephemeral per role (even if the ephemeral is the SAME across all the roles). This isn't an issue practically since image builders like Docker are content addressed and caches layers.
Reviewed By: AbishekS
Differential Revision: D83793199