Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Easy, advanced inference platform for large language models on Kubernetes
## Features Overview

- **Easy of Use**: People can quick deploy a LLM service with minimal configurations.
- **Broad Backend Support**: llmaz supports a wide range of advanced inference backends for different scenarios, like [vLLM](https://github.com/vllm-project/vllm), [Text-Generation-Inference](https://github.com/huggingface/text-generation-inference), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp). Find the full list of supported backends [here](./docs/support-backends.md).
- **Broad Backend Support**: llmaz supports a wide range of advanced inference backends for different scenarios, like [vLLM](https://github.com/vllm-project/vllm), [Text-Generation-Inference](https://github.com/huggingface/text-generation-inference), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp), [ollama](https://github.com/ollama/ollama). Find the full list of supported backends [here](./docs/support-backends.md).
- **Scaling Efficiency (WIP)**: llmaz works smoothly with autoscaling components like [Cluster-Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) or [Karpenter](https://github.com/kubernetes-sigs/karpenter) to support elastic scenarios.
- **Accelerator Fungibility (WIP)**: llmaz supports serving the same LLM with various accelerators to optimize cost and performance.
- **SOTA Inference**: llmaz supports the latest cutting-edge researches like [Speculative Decoding](https://arxiv.org/abs/2211.17192) or [Splitwise](https://arxiv.org/abs/2311.18677)(WIP) to run on Kubernetes.
Expand Down
36 changes: 36 additions & 0 deletions chart/templates/backends/ollama.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
{{- if .Values.backendRuntime.install -}}
apiVersion: inference.llmaz.io/v1alpha1
kind: BackendRuntime
metadata:
labels:
app.kubernetes.io/name: backendruntime
app.kubernetes.io/part-of: llmaz
app.kubernetes.io/created-by: llmaz
name: ollama
spec:
commands:
- sh
- -c
image: ollama/ollama
version: latest
# Do not edit the preset argument name unless you know what you're doing.
# Free to add more arguments with your requirements.
args:
- name: default
flags:
- "ollama serve &
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to support readiness and liveness next, see #21, but this is ok for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's support probe in the future

while true; do output=$(ollama list 2>&1);
if ! echo $output | grep -q 'could not connect to ollama app' && echo $output | grep -q 'NAME';then echo 'ollama is running';break; else echo 'Waiting for the ollama to be running...';sleep 1;fi;done;
ollama run {{`{{ .ModelName }}`}};
while true;do sleep 60;done"
envs:
- name: OLLAMA_HOST
value: 0.0.0.0:8080
Copy link
Member Author

@qinguoyi qinguoyi Nov 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OLLAMA_HOST can expose custom port

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great

resources:
requests:
cpu: 2
memory: 4Gi
limits:
cpu: 2
memory: 4Gi
{{- end }}
7 changes: 6 additions & 1 deletion docs/examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ We provide a set of examples to help you serve large language models, by default
- [Deploy models via SGLang](#deploy-models-via-sglang)
- [Deploy models via llama.cpp](#deploy-models-via-llamacpp)
- [Deploy models via text-generation-inference](#deploy-models-via-tgi)
- [Deploy models via ollama](#ollama)
- [Speculative Decoding with vLLM](#speculative-decoding-with-vllm)

### Deploy models from Huggingface
Expand All @@ -32,7 +33,7 @@ In theory, if we want to load the `Qwen2-7B` model, which occupies about 14.2 GB

- Alibaba Cloud OSS, see [example](./objstore-oss/) here

> Note: you should set OSS_ACCESS_KEY_ID and OSS_ACCESS_kEY_SECRET first by running `kubectl create secret generic oss-access-secret --from-literal=OSS_ACCESS_KEY_ID=<your ID> --from-literal=OSS_ACCESS_kEY_SECRET=<your secret>`
> Note: you should set OSS_ACCESS_KEY_ID and OSS_ACCESS_kEY_SECRET first by running `kubectl create secret generic oss-access-secret --from-literal=OSS_ACCESS_KEY_ID=<your ID> --from-literal=OSS_ACCESS_kEY_SECRET=<your secret>`

### Deploy models via SGLang

Expand All @@ -46,6 +47,10 @@ By default, we use [vLLM](https://github.com/vllm-project/vllm) as the inference

[text-generation-inference](https://github.com/huggingface/text-generation-inference) is used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoint. see [example](./tgi/) here.

### Deploy models via ollama

[ollama](https://github.com/ollama/ollama) based on llama.cpp, aims for local deploy. see [example](./ollama/) here.

### Speculative Decoding with vLLM

[Speculative Decoding](https://arxiv.org/abs/2211.17192) can improve inference performance efficiently, see [example](./speculative-decoding/vllm/) here.
8 changes: 8 additions & 0 deletions docs/examples/ollama/model.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
apiVersion: llmaz.io/v1alpha1
kind: OpenModel
metadata:
name: qwen2-0--5b
spec:
familyName: qwen2
source:
uri: ollama://qwen2:0.5b
10 changes: 10 additions & 0 deletions docs/examples/ollama/playground.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
apiVersion: inference.llmaz.io/v1alpha1
kind: Playground
metadata:
name: qwen2-0--5b
spec:
replicas: 1
modelClaim:
modelName: qwen2-0--5b
backendRuntimeConfig:
name: ollama
4 changes: 4 additions & 0 deletions docs/support-backends.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,10 @@ If you want to integrate more backends into llmaz, please refer to this [PR](htt

[text-generation-inference](https://github.com/huggingface/text-generation-inference) is a Rust, Python and gRPC server for text generation inference. Used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoint.

## ollama

[ollama](https://github.com/ollama/ollama) is running with Llama 3.2, Mistral, Gemma 2, and other large language models, based on llama.cpp, aims for local deploy.

## vLLM

[vLLM](https://github.com/vllm-project/vllm) is a high-throughput and memory-efficient inference and serving engine for LLMs
20 changes: 12 additions & 8 deletions pkg/controller_helper/backendruntime.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ package helper
import (
"fmt"
"regexp"
"strings"

corev1 "k8s.io/api/core/v1"

Expand Down Expand Up @@ -94,19 +95,22 @@ func (p *BackendRuntimeParser) Resources() inferenceapi.ResourceRequirements {
func renderFlags(flags []string, modelInfo map[string]string) ([]string, error) {
// Capture the word.
re := regexp.MustCompile(`\{\{\s*\.(\w+)\s*\}\}`)

res := []string{}
var value string

for _, flag := range flags {
value = flag
match := re.FindStringSubmatch(flag)
if len(match) > 1 {
// Return the matched word.
value = modelInfo[match[1]]

if value == "" {
value := flag
matches := re.FindAllStringSubmatch(flag, -1)
for _, match := range matches {
if len(match) <= 1 {
continue
}
key := match[1]
replacement, exists := modelInfo[key]
if !exists {
return nil, fmt.Errorf("missing flag or the flag has format error: %s", flag)
}
value = strings.Replace(value, match[0], replacement, -1)
}

res = append(res, value)
Expand Down
9 changes: 9 additions & 0 deletions pkg/controller_helper/backendruntime_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,15 @@ func TestRenderFlags(t *testing.T) {
wantFlags []string
wantError bool
}{
{
name: "normal parse long args",
flags: []string{"run {{ .ModelPath }};sleep 5", "--host", "0.0.0.0"},
modelInfo: map[string]string{
"ModelPath": "path/to/model",
"ModelName": "foo",
},
wantFlags: []string{"run path/to/model;sleep 5", "--host", "0.0.0.0"},
},
{
name: "normal parse",
flags: []string{"-m", "{{ .ModelPath }}", "--served-model-name", "{{ .ModelName }}", "--host", "0.0.0.0"},
Expand Down
3 changes: 2 additions & 1 deletion pkg/controller_helper/model_source/modelsource.go
Original file line number Diff line number Diff line change
Expand Up @@ -72,11 +72,12 @@ func NewModelSourceProvider(model *coreapi.OpenModel) ModelSourceProvider {
if model.Spec.Source.URI != nil {
// We'll validate the format in the webhook, so generally no error should happen here.
protocol, address, _ := util.ParseURI(string(*model.Spec.Source.URI))
provider := &URIProvider{modelName: model.Name, protocol: protocol}
provider := &URIProvider{modelName: model.Name, protocol: protocol, modelAddress: address}

switch protocol {
case OSS:
provider.endpoint, provider.bucket, provider.modelPath, _ = util.ParseOSS(address)
case OLLAMA:
default:
// This should be validated at webhooks.
panic("protocol not supported")
Expand Down
20 changes: 14 additions & 6 deletions pkg/controller_helper/model_source/uri.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,18 +26,23 @@ import (
var _ ModelSourceProvider = &URIProvider{}

const (
OSS = "OSS"
OSS = "OSS"
OLLAMA = "OLLAMA"
)

type URIProvider struct {
modelName string
protocol string
bucket string
endpoint string
modelPath string
modelName string
protocol string
bucket string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to refactor this part in the future, divide the URIProvider to more specified ones, but not a hurry.

endpoint string
modelPath string
modelAddress string
}

func (p *URIProvider) ModelName() string {
if p.protocol == OLLAMA {
return p.modelAddress
}
return p.modelName
}

Expand All @@ -58,6 +63,9 @@ func (p *URIProvider) ModelPath() string {
}

func (p *URIProvider) InjectModelLoader(template *corev1.PodTemplateSpec, index int) {
if p.protocol == OLLAMA {
return
}
initContainerName := MODEL_LOADER_CONTAINER_NAME
if index != 0 {
initContainerName += "-" + strconv.Itoa(index)
Expand Down
3 changes: 2 additions & 1 deletion pkg/webhook/openmodel_webhook.go
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,8 @@ func SetupOpenModelWebhook(mgr ctrl.Manager) error {
var _ webhook.CustomDefaulter = &OpenModelWebhook{}

var SUPPORTED_OBJ_STORES = map[string]struct{}{
modelSource.OSS: {},
modelSource.OSS: {},
modelSource.OLLAMA: {},
}

// Default implements webhook.Defaulter so a webhook will be registered for the type
Expand Down
29 changes: 29 additions & 0 deletions test/config/backends/ollama.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
apiVersion: inference.llmaz.io/v1alpha1
kind: BackendRuntime
metadata:
labels:
app.kubernetes.io/name: backendruntime
app.kubernetes.io/part-of: llmaz
app.kubernetes.io/created-by: llmaz
name: ollama
spec:
commands:
- sh
- -c
image: ollama/ollama
version: latest
args:
- name: default
flags:
- "ollama serve &
while true; do output=$(ollama list 2>&1);
if ! echo $output | grep -q 'could not connect to ollama app' && echo $output | grep -q 'NAME';then echo 'ollama is running';break; else echo 'Waiting for the ollama to be running...';sleep 1;fi;done;
ollama run {{`{{ .ModelName }}`}};
while true;do sleep 60;done"
resources:
requests:
cpu: 2
memory: 4Gi
limits:
cpu: 2
memory: 4Gi
19 changes: 19 additions & 0 deletions test/e2e/playground_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,25 @@ var _ = ginkgo.Describe("playground e2e tests", func() {
gomega.Expect(testing.DeleteNamespace(ctx, k8sClient, ns)).To(gomega.Succeed())
})

ginkgo.It("Deploy a ollama model with ollama", func() {
backendRuntime := wrapper.MakeBackendRuntime("llmaz-ollama").
Image("ollama/ollama").Version("latest").
Command([]string{"sh", "-c"}).
Arg("default", []string{"ollama serve & while true;do output=$(ollama list 2>&1);if ! echo $output | grep -q 'could not connect to ollama app' && echo $output | grep -q 'NAME';then echo 'ollama is running';break; else echo 'Waiting for the ollama to be running...';sleep 1;fi;done;ollama run {{.ModelName}};while true;do sleep 60;done"}).
Request("cpu", "2").Request("memory", "4Gi").Limit("cpu", "4").Limit("memory", "4Gi").Obj()
gomega.Expect(k8sClient.Create(ctx, backendRuntime)).To(gomega.Succeed())

model := wrapper.MakeModel("qwen2-0--5b").FamilyName("qwen2").ModelSourceWithURI("ollama://qwen2:0.5b").Obj()
gomega.Expect(k8sClient.Create(ctx, model)).To(gomega.Succeed())
defer func() {
gomega.Expect(k8sClient.Delete(ctx, model)).To(gomega.Succeed())
}()
playground := wrapper.MakePlayground("qwen2-0--5b", ns.Name).ModelClaim("qwen2-0--5b").BackendRuntime("llmaz-ollama").Replicas(1).Obj()
gomega.Expect(k8sClient.Create(ctx, playground)).To(gomega.Succeed())
validation.ValidatePlayground(ctx, k8sClient, playground)
validation.ValidatePlaygroundStatusEqualTo(ctx, k8sClient, playground, inferenceapi.PlaygroundAvailable, "PlaygroundReady", metav1.ConditionTrue)

})
ginkgo.It("Deploy a huggingface model with llama.cpp", func() {
model := wrapper.MakeModel("qwen2-0-5b-gguf").FamilyName("qwen2").ModelSourceWithModelHub("Huggingface").ModelSourceWithModelID("Qwen/Qwen2-0.5B-Instruct-GGUF", "qwen2-0_5b-instruct-q5_k_m.gguf", "", nil, nil).Obj()
gomega.Expect(k8sClient.Create(ctx, model)).To(gomega.Succeed())
Expand Down
28 changes: 28 additions & 0 deletions test/integration/controller/inference/playground_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -264,6 +264,34 @@ var _ = ginkgo.Describe("playground controller test", func() {
},
},
}),
ginkgo.Entry("advance configured Playground with ollama", &testValidatingCase{
makePlayground: func() *inferenceapi.Playground {
return wrapper.MakePlayground("playground", ns.Name).ModelClaim(model.Name).Label(coreapi.ModelNameLabelKey, model.Name).
BackendRuntime("ollama").BackendRuntimeVersion("main").BackendRuntimeArgs([]string{"--foo", "bar"}).BackendRuntimeEnv("FOO", "BAR").
BackendRuntimeRequest("cpu", "1").BackendRuntimeLimit("cpu", "10").
Obj()
},
updates: []*update{
{
updateFunc: func(playground *inferenceapi.Playground) {
gomega.Expect(k8sClient.Create(ctx, playground)).To(gomega.Succeed())
},
checkFunc: func(ctx context.Context, k8sClient client.Client, playground *inferenceapi.Playground) {
validation.ValidatePlayground(ctx, k8sClient, playground)
validation.ValidatePlaygroundStatusEqualTo(ctx, k8sClient, playground, inferenceapi.PlaygroundProgressing, "Pending", metav1.ConditionTrue)
},
},
{
updateFunc: func(playground *inferenceapi.Playground) {
util.UpdateLwsToReady(ctx, k8sClient, playground.Name, playground.Namespace)
},
checkFunc: func(ctx context.Context, k8sClient client.Client, playground *inferenceapi.Playground) {
validation.ValidatePlayground(ctx, k8sClient, playground)
validation.ValidatePlaygroundStatusEqualTo(ctx, k8sClient, playground, inferenceapi.PlaygroundAvailable, "PlaygroundReady", metav1.ConditionTrue)
},
},
},
}),
ginkgo.Entry("playground is created when service exists with the same name", &testValidatingCase{
makePlayground: func() *inferenceapi.Playground {
return util.MockASamplePlayground(ns.Name)
Expand Down
Loading