llmaz (pronounced /lima:z/
), aims to provide a Production-Ready inference platform for large language models on Kubernetes. It closely integrates with state-of-the-art inference backends like vLLM to bring the cutting-edge researches to cloud.
Read the Installation for guidance.
Once Model
s (e.g. facebook/opt-125m) are published, you can quick deploy a Playground
to serve the model.
apiVersion: llmaz.io/v1alpha1 kind: Model metadata: name: opt-125m spec: familyName: opt dataSource: modelID: facebook/opt-125m inferenceFlavors: - name: t4 # GPU type requests: nvidia.com/gpu: 1
apiVersion: inference.llmaz.io/v1alpha1 kind: Playground metadata: name: opt-125m spec: replicas: 1 modelClaim: modelName: opt-125m
kubectl port-forward pod/opt-125m-0 8080:8080
curl http://localhost:8080/v1/models
curl http://localhost:8080/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "facebook/opt-125m", "prompt": "San Francisco is a", "max_tokens": 10, "temperature": 0 }'
Refer to examples to learn more.
π All kinds of contributions are welcomed ! Please follow Contributing.
π Thanks to all these contributors.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4