Skip to content

Commit f0483b1

Browse files
committed
Support ollama
1 parent 828ac74 commit f0483b1

File tree

14 files changed

+180
-18
lines changed

14 files changed

+180
-18
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ Easy, advanced inference platform for large language models on Kubernetes
2727
## Features Overview
2828

2929
- **Easy of Use**: People can quick deploy a LLM service with minimal configurations.
30-
- **Broad Backend Support**: llmaz supports a wide range of advanced inference backends for different scenarios, like [vLLM](https://github.com/vllm-project/vllm), [Text-Generation-Inference](https://github.com/huggingface/text-generation-inference), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp). Find the full list of supported backends [here](./docs/support-backends.md).
30+
- **Broad Backend Support**: llmaz supports a wide range of advanced inference backends for different scenarios, like [vLLM](https://github.com/vllm-project/vllm), [Text-Generation-Inference](https://github.com/huggingface/text-generation-inference), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp), [ollama](https://github.com/ollama/ollama). Find the full list of supported backends [here](./docs/support-backends.md).
3131
- **Scaling Efficiency (WIP)**: llmaz works smoothly with autoscaling components like [Cluster-Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) or [Karpenter](https://github.com/kubernetes-sigs/karpenter) to support elastic scenarios.
3232
- **Accelerator Fungibility (WIP)**: llmaz supports serving the same LLM with various accelerators to optimize cost and performance.
3333
- **SOTA Inference**: llmaz supports the latest cutting-edge researches like [Speculative Decoding](https://arxiv.org/abs/2211.17192) or [Splitwise](https://arxiv.org/abs/2311.18677)(WIP) to run on Kubernetes.
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
{{- if .Values.backendRuntime.install -}}
2+
apiVersion: inference.llmaz.io/v1alpha1
3+
kind: BackendRuntime
4+
metadata:
5+
labels:
6+
app.kubernetes.io/name: backendruntime
7+
app.kubernetes.io/part-of: llmaz
8+
app.kubernetes.io/created-by: llmaz
9+
name: ollama
10+
spec:
11+
commands:
12+
- sh
13+
- -c
14+
image: ollama/ollama
15+
version: latest
16+
# Do not edit the preset argument name unless you know what you're doing.
17+
# Free to add more arguments with your requirements.
18+
args:
19+
- name: default
20+
flags:
21+
- "ollama serve &
22+
while true; do output=$(ollama list 2>&1);
23+
if ! echo $output | grep -q 'could not connect to ollama app' && echo $output | grep -q 'NAME';then echo 'ollama is running';break; else echo 'Waiting for the ollama to be running...';sleep 1;fi;done;
24+
ollama run {{`{{ .ModelName }}`}};
25+
while true;do sleep 60;done"
26+
envs:
27+
- name: OLLAMA_HOST
28+
value: 0.0.0.0:8080
29+
resources:
30+
requests:
31+
cpu: 2
32+
memory: 4Gi
33+
limits:
34+
cpu: 2
35+
memory: 4Gi
36+
{{- end }}

docs/examples/README.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ We provide a set of examples to help you serve large language models, by default
1010
- [Deploy models via SGLang](#deploy-models-via-sglang)
1111
- [Deploy models via llama.cpp](#deploy-models-via-llamacpp)
1212
- [Deploy models via text-generation-inference](#deploy-models-via-tgi)
13+
- [Deploy models via ollama](#ollama)
1314
- [Speculative Decoding with vLLM](#speculative-decoding-with-vllm)
1415

1516
### Deploy models from Huggingface
@@ -32,7 +33,7 @@ In theory, if we want to load the `Qwen2-7B` model, which occupies about 14.2 GB
3233

3334
- Alibaba Cloud OSS, see [example](./objstore-oss/) here
3435

35-
> Note: you should set OSS_ACCESS_KEY_ID and OSS_ACCESS_kEY_SECRET first by running `kubectl create secret generic oss-access-secret --from-literal=OSS_ACCESS_KEY_ID=<your ID> --from-literal=OSS_ACCESS_kEY_SECRET=<your secret>`
36+
> Note: you should set OSS_ACCESS_KEY_ID and OSS_ACCESS_kEY_SECRET first by running `kubectl create secret generic oss-access-secret --from-literal=OSS_ACCESS_KEY_ID=<your ID> --from-literal=OSS_ACCESS_kEY_SECRET=<your secret>`
3637
3738
### Deploy models via SGLang
3839

@@ -46,6 +47,10 @@ By default, we use [vLLM](https://github.com/vllm-project/vllm) as the inference
4647

4748
[text-generation-inference](https://github.com/huggingface/text-generation-inference) is used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoint. see [example](./tgi/) here.
4849

50+
### Deploy models via ollama
51+
52+
[ollama](https://github.com/ollama/ollama) based on llama.cpp, aims for local deploy. see [example](./ollama/) here.
53+
4954
### Speculative Decoding with vLLM
5055

5156
[Speculative Decoding](https://arxiv.org/abs/2211.17192) can improve inference performance efficiently, see [example](./speculative-decoding/vllm/) here.

docs/examples/ollama/model.yaml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
apiVersion: llmaz.io/v1alpha1
2+
kind: OpenModel
3+
metadata:
4+
name: qwen2-0--5b
5+
spec:
6+
familyName: qwen2
7+
source:
8+
uri: ollama://qwen2:0.5b
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
apiVersion: inference.llmaz.io/v1alpha1
2+
kind: Playground
3+
metadata:
4+
name: qwen2-0--5b
5+
spec:
6+
replicas: 1
7+
modelClaim:
8+
modelName: qwen2-0--5b
9+
backendRuntimeConfig:
10+
name: ollama

docs/support-backends.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,10 @@ If you want to integrate more backends into llmaz, please refer to this [PR](htt
1414

1515
[text-generation-inference](https://github.com/huggingface/text-generation-inference) is a Rust, Python and gRPC server for text generation inference. Used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoint.
1616

17+
## ollama
18+
19+
[ollama](https://github.com/ollama/ollama) is running with Llama 3.2, Mistral, Gemma 2, and other large language models, based on llama.cpp, aims for local deploy.
20+
1721
## vLLM
1822

1923
[vLLM](https://github.com/vllm-project/vllm) is a high-throughput and memory-efficient inference and serving engine for LLMs

pkg/controller_helper/backendruntime.go

Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ package helper
1919
import (
2020
"fmt"
2121
"regexp"
22+
"strings"
2223

2324
corev1 "k8s.io/api/core/v1"
2425

@@ -94,19 +95,22 @@ func (p *BackendRuntimeParser) Resources() inferenceapi.ResourceRequirements {
9495
func renderFlags(flags []string, modelInfo map[string]string) ([]string, error) {
9596
// Capture the word.
9697
re := regexp.MustCompile(`\{\{\s*\.(\w+)\s*\}\}`)
98+
9799
res := []string{}
98-
var value string
99100

100101
for _, flag := range flags {
101-
value = flag
102-
match := re.FindStringSubmatch(flag)
103-
if len(match) > 1 {
104-
// Return the matched word.
105-
value = modelInfo[match[1]]
106-
107-
if value == "" {
102+
value := flag
103+
matches := re.FindAllStringSubmatch(flag, -1)
104+
for _, match := range matches {
105+
if len(match) <= 1 {
106+
continue
107+
}
108+
key := match[1]
109+
replacement, exists := modelInfo[key]
110+
if !exists {
108111
return nil, fmt.Errorf("missing flag or the flag has format error: %s", flag)
109112
}
113+
value = strings.Replace(value, match[0], replacement, -1)
110114
}
111115

112116
res = append(res, value)

pkg/controller_helper/backendruntime_test.go

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,15 @@ func TestRenderFlags(t *testing.T) {
3030
wantFlags []string
3131
wantError bool
3232
}{
33+
{
34+
name: "normal parse long args",
35+
flags: []string{"run {{ .ModelPath }};sleep 5", "--host", "0.0.0.0"},
36+
modelInfo: map[string]string{
37+
"ModelPath": "path/to/model",
38+
"ModelName": "foo",
39+
},
40+
wantFlags: []string{"run path/to/model;sleep 5", "--host", "0.0.0.0"},
41+
},
3342
{
3443
name: "normal parse",
3544
flags: []string{"-m", "{{ .ModelPath }}", "--served-model-name", "{{ .ModelName }}", "--host", "0.0.0.0"},

pkg/controller_helper/model_source/modelsource.go

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,11 +72,12 @@ func NewModelSourceProvider(model *coreapi.OpenModel) ModelSourceProvider {
7272
if model.Spec.Source.URI != nil {
7373
// We'll validate the format in the webhook, so generally no error should happen here.
7474
protocol, address, _ := util.ParseURI(string(*model.Spec.Source.URI))
75-
provider := &URIProvider{modelName: model.Name, protocol: protocol}
75+
provider := &URIProvider{modelName: model.Name, protocol: protocol, modelAddress: address}
7676

7777
switch protocol {
7878
case OSS:
7979
provider.endpoint, provider.bucket, provider.modelPath, _ = util.ParseOSS(address)
80+
case OLLAMA:
8081
default:
8182
// This should be validated at webhooks.
8283
panic("protocol not supported")

pkg/controller_helper/model_source/uri.go

Lines changed: 14 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -26,18 +26,23 @@ import (
2626
var _ ModelSourceProvider = &URIProvider{}
2727

2828
const (
29-
OSS = "OSS"
29+
OSS = "OSS"
30+
OLLAMA = "OLLAMA"
3031
)
3132

3233
type URIProvider struct {
33-
modelName string
34-
protocol string
35-
bucket string
36-
endpoint string
37-
modelPath string
34+
modelName string
35+
protocol string
36+
bucket string
37+
endpoint string
38+
modelPath string
39+
modelAddress string
3840
}
3941

4042
func (p *URIProvider) ModelName() string {
43+
if p.protocol == OLLAMA {
44+
return p.modelAddress
45+
}
4146
return p.modelName
4247
}
4348

@@ -58,6 +63,9 @@ func (p *URIProvider) ModelPath() string {
5863
}
5964

6065
func (p *URIProvider) InjectModelLoader(template *corev1.PodTemplateSpec, index int) {
66+
if p.protocol == OLLAMA {
67+
return
68+
}
6169
initContainerName := MODEL_LOADER_CONTAINER_NAME
6270
if index != 0 {
6371
initContainerName += "-" + strconv.Itoa(index)

0 commit comments

Comments
 (0)