llmaz (pronounced /lima:z/
), aims to provide a Production-Ready inference platform for large language models on Kubernetes. It closely integrates with the state-of-the-art inference backends to bring the leading-edge researches to cloud.
🌱 llmaz is alpha now, so API may change before graduating to Beta.
Read the Installation for guidance.
Here's a toy example for deploying facebook/opt-125m
, all you need to do is to apply a Model
and a Playground
.
If you're running on CPUs, you can refer to llama.cpp, or more examples here.
Note: if your model needs Huggingface token for weight downloads, please run
kubectl create secret generic modelhub-secret --from-literal=HF_TOKEN=<your token>
ahead.
apiVersion: llmaz.io/v1alpha1 kind: OpenModel metadata: name: opt-125m spec: familyName: opt source: modelHub: modelID: facebook/opt-125m inferenceConfig: flavors: - name: default # Configure GPU type requests: nvidia.com/gpu: 1
apiVersion: inference.llmaz.io/v1alpha1 kind: Playground metadata: name: opt-125m spec: replicas: 1 modelClaim: modelName: opt-125m
By default, llmaz will create a ClusterIP service named like <service>-lb
for load balancing.
kubectl port-forward svc/opt-125m-lb 8080:8080
curl http://localhost:8080/v1/models
curl http://localhost:8080/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "opt-125m", "prompt": "San Francisco is a", "max_tokens": 10, "temperature": 0 }'
If you want to learn more about this project, please refer to develop.md.
Join us for more discussions:
All kinds of contributions are welcomed ! Please following CONTRIBUTING.md.
We also have an official fundraising venue through OpenCollective. We'll use the fund transparently to support the development, maintenance, and adoption of our project.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4