RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/googs1025/llmaz below:

googs1025/llmaz: ☸️ Easy, advanced inference platform for large language models on Kubernetes

Easy, advanced inference platform for large language models on Kubernetes

llmaz (pronounced /lima:z/), aims to provide a Production-Ready inference platform for large language models on Kubernetes. It closely integrates with the state-of-the-art inference backends to bring the leading-edge researches to cloud.

🌱 llmaz is alpha now, so API may change before graduating to Beta.

Easy of Use: People can quick deploy a LLM service with minimal configurations.
Broad Backends Support: llmaz supports a wide range of advanced inference backends for different scenarios, like vLLM, Text-Generation-Inference, SGLang, llama.cpp. Find the full list of supported backends here.
Efficient Model Distribution (WIP): Out-of-the-box model cache system support with Manta, still under development right now with architecture reframing.
Accelerator Fungibility: llmaz supports serving the same LLM with various accelerators to optimize cost and performance.
SOTA Inference: llmaz supports the latest cutting-edge researches like Speculative Decoding or Splitwise(WIP) to run on Kubernetes.
Various Model Providers: llmaz supports a wide range of model providers, such as HuggingFace, ModelScope, ObjectStores. llmaz will automatically handle the model loading, requiring no effort from users.
Multi-Host Support: llmaz supports both single-host and multi-host scenarios with LWS from day 0.
Scaling Efficiency: llmaz supports horizontal scaling with HPA by default and will integrate with autoscaling components like Cluster-Autoscaler or Karpenter for smart scaling across different clouds.

Read the Installation for guidance.

Here's a toy example for deploying facebook/opt-125m, all you need to do is to apply a Model and a Playground.

If you're running on CPUs, you can refer to llama.cpp, or more examples here.

Note: if your model needs Huggingface token for weight downloads, please run kubectl create secret generic modelhub-secret --from-literal=HF_TOKEN=<your token> ahead.

apiVersion: llmaz.io/v1alpha1
kind: OpenModel
metadata:
  name: opt-125m
spec:
  familyName: opt
  source:
    modelHub:
      modelID: facebook/opt-125m
  inferenceConfig:
    flavors:
      - name: default # Configure GPU type
        requests:
          nvidia.com/gpu: 1

apiVersion: inference.llmaz.io/v1alpha1
kind: Playground
metadata:
  name: opt-125m
spec:
  replicas: 1
  modelClaim:
    modelName: opt-125m

By default, llmaz will create a ClusterIP service named like <service>-lb for load balancing.

kubectl port-forward svc/opt-125m-lb 8080:8080

curl http://localhost:8080/v1/models

curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "opt-125m",
    "prompt": "San Francisco is a",
    "max_tokens": 10,
    "temperature": 0
}'

If you want to learn more about this project, please refer to develop.md.

Gateway support for traffic routing
Metrics support
Serverless support for cloud-agnostic users
CLI tool support
Model training, fine tuning in the long-term

Join us for more discussions:

Slack Channel: #llmaz

All kinds of contributions are welcomed ! Please following CONTRIBUTING.md.

We also have an official fundraising venue through OpenCollective. We'll use the fund transparently to support the development, maintenance, and adoption of our project.

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4