Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 9 additions & 5 deletions api/core/v1alpha1/model_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -101,16 +101,16 @@ type FlavorName string
type Flavor struct {
// Name represents the flavor name, which will be used in model claim.
Name FlavorName `json:"name"`
// Requests defines the required accelerators to serve the model for each replica,
// like <nvidia.com/gpu: 8>. For multi-hosts cases, the requests here indicates
// Limits defines the required accelerators to serve the model for each replica,
// like <nvidia.com/gpu: 8>. For multi-hosts cases, the limits here indicates
// the resource requirements for each replica, usually equals to the TP size.
// Not recommended to set the cpu and memory usage here:
// - if using playground, you can define the cpu/mem usage at backendConfig.
// - if using inference service, you can define the cpu/mem at the container resources.
// However, if you define the same accelerator requests at playground/service as well,
// the requests will be overwritten by the flavor requests.
// However, if you define the same accelerator resources at playground/service as well,
// the resources will be overwritten by the flavor limit here.
// +optional
Requests v1.ResourceList `json:"requests,omitempty"`
Limits v1.ResourceList `json:"limits,omitempty"`
// NodeSelector represents the node candidates for Pod placements, if a node doesn't
// meet the nodeSelector, it will be filtered out in the resourceFungibility scheduler plugin.
// If nodeSelector is empty, it means every node is a candidate.
Expand All @@ -129,11 +129,15 @@ type Flavor struct {
type InferenceConfig struct {
// Flavors represents the accelerator requirements to serve the model.
// Flavors are fungible following the priority represented by the slice order.
// This is used both in Playground and Inference Service.
// +kubebuilder:validation:MaxItems=8
// +optional
Flavors []Flavor `json:"flavors,omitempty"`
// SharedMemorySize represents the size of /dev/shm required in the runtime of
// inference workload.
// This is only used in Playground. Inference Service can configure the shared memory
// directly in PodSpec.
// +optional
SharedMemorySize *resource.Quantity `json:"sharedMemorySize,omitempty"`
}

Expand Down
9 changes: 7 additions & 2 deletions api/core/v1alpha1/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

10 changes: 5 additions & 5 deletions client-go/applyconfiguration/core/v1alpha1/flavor.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

15 changes: 14 additions & 1 deletion client-go/applyconfiguration/core/v1alpha1/inferenceconfig.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

37 changes: 20 additions & 17 deletions config/crd/bases/llmaz.io_openmodels.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -54,13 +54,31 @@ spec:
description: |-
Flavors represents the accelerator requirements to serve the model.
Flavors are fungible following the priority represented by the slice order.
This is used both in Playground and Inference Service.
items:
description: |-
Flavor defines the accelerator requirements for a model and the necessary parameters
in autoscaling. Right now, it will be used in two places:
- Pod scheduling with node selectors specified.
- Cluster autoscaling with essential parameters provided.
properties:
limits:
additionalProperties:
anyOf:
- type: integer
- type: string
pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
x-kubernetes-int-or-string: true
description: |-
Limits defines the required accelerators to serve the model for each replica,
like <nvidia.com/gpu: 8>. For multi-hosts cases, the limits here indicates
the resource requirements for each replica, usually equals to the TP size.
Not recommended to set the cpu and memory usage here:
- if using playground, you can define the cpu/mem usage at backendConfig.
- if using inference service, you can define the cpu/mem at the container resources.
However, if you define the same accelerator resources at playground/service as well,
the resources will be overwritten by the flavor limit here.
type: object
name:
description: Name represents the flavor name, which will
be used in model claim.
Expand All @@ -83,23 +101,6 @@ spec:
with <INSTANCE-TYPE: p4d.24xlarge> for AWS.
Preset parameters: TP, PP, INSTANCE-TYPE.
type: object
requests:
additionalProperties:
anyOf:
- type: integer
- type: string
pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
x-kubernetes-int-or-string: true
description: |-
Requests defines the required accelerators to serve the model for each replica,
like <nvidia.com/gpu: 8>. For multi-hosts cases, the requests here indicates
the resource requirements for each replica, usually equals to the TP size.
Not recommended to set the cpu and memory usage here:
- if using playground, you can define the cpu/mem usage at backendConfig.
- if using inference service, you can define the cpu/mem at the container resources.
However, if you define the same accelerator requests at playground/service as well,
the requests will be overwritten by the flavor requests.
type: object
required:
- name
type: object
Expand All @@ -112,6 +113,8 @@ spec:
description: |-
SharedMemorySize represents the size of /dev/shm required in the runtime of
inference workload.
This is only used in Playground. Inference Service can configure the shared memory
directly in PodSpec.
pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
x-kubernetes-int-or-string: true
type: object
Expand Down
2 changes: 1 addition & 1 deletion docs/examples/hostpath/model.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,5 @@ spec:
inferenceConfig:
flavors:
- name: t4 # GPU type
requests:
limits:
nvidia.com/gpu: 1
2 changes: 1 addition & 1 deletion docs/examples/huggingface/model.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,5 @@ spec:
inferenceConfig:
flavors:
- name: t4 # GPU type
requests:
limits:
nvidia.com/gpu: 1
2 changes: 1 addition & 1 deletion docs/examples/modelscope/model.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,5 @@ spec:
inferenceConfig:
flavors:
- name: t4 # GPU type
requests:
limits:
nvidia.com/gpu: 1
4 changes: 2 additions & 2 deletions docs/examples/multi-nodes/model.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,13 @@ spec:
inferenceConfig:
flavors:
- name: a100-80gb
requests:
limits:
nvidia.com/gpu: 8 # single node request
params:
TP: "8" # 8 GPUs per node, equal to nvidia.com/gpu
PP: "2" # 2 nodes
# - name: h100
# requests:
# limits:
# nvidia.com/gpu: 8 # single node request
# params:
# TP: "8"
Expand Down
2 changes: 1 addition & 1 deletion docs/examples/objstore-oss/model.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,5 @@ spec:
inferenceConfig:
flavors:
- name: t4 # GPU type
requests:
limits:
nvidia.com/gpu: 1
2 changes: 1 addition & 1 deletion docs/examples/sglang/model.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,5 @@ spec:
inferenceConfig:
flavors:
- name: t4 # GPU type
requests:
limits:
nvidia.com/gpu: 1
2 changes: 1 addition & 1 deletion docs/examples/speculative-decoding/vllm/model.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ spec:
inferenceConfig:
flavors:
- name: a10 # gpu type
requests:
limits:
nvidia.com/gpu: 1
---
apiVersion: llmaz.io/v1alpha1
Expand Down
2 changes: 1 addition & 1 deletion docs/examples/tgi/model.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,5 @@ spec:
inferenceConfig:
flavors:
- name: t4 # GPU type
requests:
limits:
nvidia.com/gpu: 1
4 changes: 2 additions & 2 deletions pkg/controller/inference/service_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -201,8 +201,8 @@ func injectModelFlavor(template *corev1.PodTemplateSpec, model *coreapi.OpenMode

for i, flavor := range model.Spec.InferenceConfig.Flavors {
if flavor.Name == flavorName {
requests := model.Spec.InferenceConfig.Flavors[i].Requests
for k, v := range requests {
limits := model.Spec.InferenceConfig.Flavors[i].Limits
for k, v := range limits {
if container.Resources.Requests == nil {
container.Resources.Requests = map[corev1.ResourceName]resource.Quantity{}
}
Expand Down
3 changes: 3 additions & 0 deletions pkg/controller_helper/helper.go
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,9 @@ func FirstAssignedFlavor(model *coreapi.OpenModel, playground *inferenceapi.Play
// the second one is whether this is a multi-host inference.
func MultiHostInference(model *coreapi.OpenModel, playground *inferenceapi.Playground) (int32, bool) {
flavors := FirstAssignedFlavor(model, playground)
// This is not valid for all cases, like SGLang uses TP for model parallelism.
// However, this is not a recommend way since TP requires more communication than PP.
// It's ok to support PP only at this moment.
if len(flavors) > 0 && flavors[0].Params["PP"] != "" {
size, err := strconv.Atoi(flavors[0].Params["PP"])
if err != nil {
Expand Down
4 changes: 2 additions & 2 deletions test/util/validation/validate_service.go
Original file line number Diff line number Diff line change
Expand Up @@ -174,9 +174,9 @@ func ValidateModelFlavor(service *inferenceapi.Service, model *coreapi.OpenModel

for _, flavor := range model.Spec.InferenceConfig.Flavors {
if flavor.Name == flavorName {
requests := flavor.Requests
limits := flavor.Limits
container := workload.Spec.LeaderWorkerTemplate.WorkerTemplate.Spec.Containers[0]
for k, v := range requests {
for k, v := range limits {
if !container.Resources.Requests[k].Equal(v) {
return fmt.Errorf("unexpected request value %v, got %v", v, workload.Spec.LeaderWorkerTemplate.WorkerTemplate.Spec.Containers[0].Resources.Requests[k])
}
Expand Down
6 changes: 3 additions & 3 deletions test/util/wrapper/model.go
Original file line number Diff line number Diff line change
Expand Up @@ -133,10 +133,10 @@ func (w *FlavorWrapper) Obj() *coreapi.Flavor {
}

func (w *FlavorWrapper) SetRequest(r, v string) *FlavorWrapper {
if w.Requests == nil {
w.Requests = map[v1.ResourceName]resource.Quantity{}
if w.Limits == nil {
w.Limits = map[v1.ResourceName]resource.Quantity{}
}
w.Requests[v1.ResourceName(r)] = resource.MustParse(v)
w.Limits[v1.ResourceName(r)] = resource.MustParse(v)
return w
}

Expand Down
Loading