This article demonstrates how to enable optimizations for large language models (LLMs) on Mosaic AI Model Serving.
Optimized LLM serving provides throughput and latency improvements in the range of 3-5 times better compared to traditional serving approaches. The following table summarizes the supported LLM families and their variants.
Databricks recommends installing foundation models using Databricks Marketplace. You can search for a model family and from the model page, select Get access and provide login credentials to install the model to Unity Catalog.
RequirementsâOptimized LLM serving is supported as part of the Public Preview of GPU deployments.
Your model must be logged using MLflow 2.4 and above or Databricks Runtime 13.2 ML and above.
Databricks recommends using models in Unity Catalog for faster upload and download of large models.
When deploying models, it's essential to match your model's parameter size with the appropriate compute size. See the table below for recommendations. For models with 50 billion or more parameters, please reach out to your Databricks account team to access the necessary GPUs.
First, log your model with the MLflow transformers
flavor and specify the task field in the MLflow metadata with metadata = {"task": "llm/v1/completions"}
. This specifies the API signature used for the model serving endpoint.
Optimized LLM serving is compatible with the route types supported by Databricks AI Gateway; currently, llm/v1/completions
. If there is a model family or task type you want to serve that is not supported, reach out to your Databricks account team.
Python
model = AutoModelForCausalLM.from_pretrained("mosaicml/mpt-7b-instruct",torch_dtype=torch.bfloat16, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("mosaicml/mpt-7b-instruct")
with mlflow.start_run():
components = {
"model": model,
"tokenizer": tokenizer,
}
mlflow.transformers.log_model(
artifact_path="model",
transformers_model=components,
input_example=["Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat is Apache Spark?\n\n### Response:\n"],
metadata={"task": "llm/v1/completions"},
registered_model_name='mpt'
)
After your model is logged you can register your models in Unity Catalog with the following where you replace CATALOG.SCHEMA.MODEL_NAME
with the three-level name of the model.
Python
mlflow.set_registry_uri("databricks-uc")
registered_model_name=CATALOG.SCHEMA.MODEL_NAME
Create your model serving endpointâ
Next, create your model serving endpoint. If your model is supported by Optimized LLM serving, Databricks automatically creates an optimized model serving endpoint when you try to serve it.
Python
import requests
import json
endpoint_name = "llama2-3b-chat"
model_name = "ml.llm-catalog.llama-13b"
model_version = 3
workload_type = "GPU_MEDIUM"
workload_size = "Small"
scale_to_zero = False
API_ROOT = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().get()
API_TOKEN = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()
data = {
"name": endpoint_name,
"config": {
"served_models": [
{
"model_name": model_name,
"model_version": model_version,
"workload_size": workload_size,
"scale_to_zero_enabled": scale_to_zero,
"workload_type": workload_type,
}
]
},
}
headers = {"Context-Type": "text/json", "Authorization": f"Bearer {API_TOKEN}"}
response = requests.post(
url=f"{API_ROOT}/api/2.0/serving-endpoints", json=data, headers=headers
)
print(json.dumps(response.json(), indent=4))
Input and output schema formatâ
An optimized LLM serving endpoint has an input and output schemas that Databricks controls. Four different formats are supported.
dataframe_split
is JSON-serialized Pandas Dataframe in the split
orientation.
JSON
{
"dataframe_split": {
"columns": ["prompt"],
"index": [0],
"data": [
[
"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instructions:\nWhat is Apache Spark?\n\n### Response:\n"
]
]
},
"params": {
"temperature": 0.5,
"max_tokens": 100,
"stop": ["word1", "word2"],
"candidate_count": 1
}
}
dataframe_records
is JSON-serialized Pandas Dataframe in the records
orientation.
JSON
{
"dataframe_records": [
{
"prompt": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instructions:\nWhat is Apache Spark?\n\n### Response:\n"
}
],
"params": {
"temperature": 0.5,
"max_tokens": 100,
"stop": ["word1", "word2"],
"candidate_count": 1
}
}
instances
JSON
{
"instances": [
{
"prompt": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instructions:\nWhat is Apache Spark?\n\n### Response:\n"
}
],
"params": {
"temperature": 0.5,
"max_tokens": 100,
"stop": ["word1", "word2"],
"candidate_count": 1
}
}
inputs
JSON
{
"inputs": {
"prompt": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instructions:\nWhat is Apache Spark?\n\n### Response:\n"
},
"params": {
"temperature": 0.5,
"max_tokens": 100,
"stop": ["word1", "word2"],
"candidate_count": 1
}
}
After your endpoint is ready, you can query it by making an API request. Depending on the model size and complexity, it can take 30 minutes or more for the endpoint to get ready.
Python
data = {
"inputs": {
"prompt": [
"Hello, I'm a language model,"
]
},
"params": {
"max_tokens": 100,
"temperature": 0.0
}
}
headers = {"Context-Type": "text/json", "Authorization": f"Bearer {API_TOKEN}"}
response = requests.post(
url=f"{API_ROOT}/serving-endpoints/{endpoint_name}/invocations", json=data, headers=headers
)
print(json.dumps(response.json()))
Limitationsâ
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4