Use Azure AI Evaluation SDK to assess the performance of your generative AI applications. Generative AI application generations are quantitatively measured with mathematical based metrics, AI-assisted quality and safety metrics. Metrics are defined as evaluators
. Built-in or custom evaluators can provide comprehensive insights into the application's capabilities and limitations.
Use Azure AI Evaluation SDK to:
Azure AI SDK provides following to evaluate Generative AI Applications:
evaluate
API.Source code | Package (PyPI) | API reference documentation | Product documentation | Samples
Getting started PrerequisitesInstall the Azure AI Evaluation SDK for Python with pip:
pip install azure-ai-evaluation
Key concepts Evaluators
Evaluators are custom or prebuilt classes or functions that are designed to measure the quality of the outputs from language models or generative AI applications.
Built-in evaluatorsBuilt-in evaluators are out of box evaluators provided by Microsoft: | Category | Evaluator class | |-----------|------------------------------------------------------------------------------------------------------------------------------------| | Performance and quality (AI-assisted) | GroundednessEvaluator
, RelevanceEvaluator
, CoherenceEvaluator
, FluencyEvaluator
, SimilarityEvaluator
, RetrievalEvaluator
| | Performance and quality (NLP) | F1ScoreEvaluator
, RougeScoreEvaluator
, GleuScoreEvaluator
, BleuScoreEvaluator
, MeteorScoreEvaluator
| | Risk and safety (AI-assisted) | ViolenceEvaluator
, SexualEvaluator
, SelfHarmEvaluator
, HateUnfairnessEvaluator
, IndirectAttackEvaluator
, ProtectedMaterialEvaluator
| | Composite | QAEvaluator
, ContentSafetyEvaluator
|
For more in-depth information on each evaluator definition and how it's calculated, see Evaluation and monitoring metrics for generative AI.
import os
from azure.ai.evaluation import evaluate, RelevanceEvaluator, ViolenceEvaluator, BleuScoreEvaluator
# NLP bleu score evaluator
bleu_score_evaluator = BleuScoreEvaluator()
result = bleu_score(
response="Tokyo is the capital of Japan.",
ground_truth="The capital of Japan is Tokyo."
)
# AI assisted quality evaluator
model_config = {
"azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
"api_key": os.environ.get("AZURE_OPENAI_API_KEY"),
"azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
}
relevance_evaluator = RelevanceEvaluator(model_config)
result = relevance_evaluator(
query="What is the capital of Japan?",
response="The capital of Japan is Tokyo."
)
# There are two ways to provide Azure AI Project.
# Option #1 : Using Azure AI Project Details
azure_ai_project = {
"subscription_id": "<subscription_id>",
"resource_group_name": "<resource_group_name>",
"project_name": "<project_name>",
}
violence_evaluator = ViolenceEvaluator(azure_ai_project)
result = violence_evaluator(
query="What is the capital of France?",
response="Paris."
)
# Option # 2 : Using Azure AI Project Url
azure_ai_project = "https://{resource_name}.services.ai.azure.com/api/projects/{project_name}"
violence_evaluator = ViolenceEvaluator(azure_ai_project)
result = violence_evaluator(
query="What is the capital of France?",
response="Paris."
)
Custom evaluators
Built-in evaluators are great out of the box to start evaluating your application's generations. However you can build your own code-based or prompt-based evaluator to cater to your specific evaluation needs.
# Custom evaluator as a function to calculate response length
def response_length(response, **kwargs):
return len(response)
# Custom class based evaluator to check for blocked words
class BlocklistEvaluator:
def __init__(self, blocklist):
self._blocklist = blocklist
def __call__(self, *, response: str, **kwargs):
score = any([word in answer for word in self._blocklist])
return {"score": score}
blocklist_evaluator = BlocklistEvaluator(blocklist=["bad, worst, terrible"])
result = response_length("The capital of Japan is Tokyo.")
result = blocklist_evaluator(answer="The capital of Japan is Tokyo.")
Evaluate API
The package provides an evaluate
API which can be used to run multiple evaluators together to evaluate generative AI application response.
from azure.ai.evaluation import evaluate
result = evaluate(
data="data.jsonl", # provide your data here
evaluators={
"blocklist": blocklist_evaluator,
"relevance": relevance_evaluator
},
# column mapping
evaluator_config={
"relevance": {
"column_mapping": {
"query": "${data.queries}"
"ground_truth": "${data.ground_truth}"
"response": "${outputs.response}"
}
}
}
# Optionally provide your AI Foundry project information to track your evaluation results in your Azure AI Foundry project
azure_ai_project = azure_ai_project,
# Optionally provide an output path to dump a json of metric summary, row level data and metric and AI Foundry URL
output_path="./evaluation_results.json"
)
For more details refer to Evaluate on test dataset using evaluate()
Evaluate generative AI applicationfrom askwiki import askwiki
result = evaluate(
data="data.jsonl",
target=askwiki,
evaluators={
"relevance": relevance_eval
},
evaluator_config={
"default": {
"column_mapping": {
"query": "${data.queries}"
"context": "${outputs.context}"
"response": "${outputs.response}"
}
}
}
)
Above code snippet refers to askwiki application in this sample.
For more details refer to Evaluate on a target
SimulatorSimulators allow users to generate synthentic data using their application. Simulator expects the user to have a callback method that invokes their AI application. The intergration between your AI application and the simulator happens at the callback method. Here's how a sample callback would look like:
async def callback(
messages: Dict[str, List[Dict]],
stream: bool = False,
session_state: Any = None,
context: Optional[Dict[str, Any]] = None,
) -> dict:
messages_list = messages["messages"]
# Get the last message from the user
latest_message = messages_list[-1]
query = latest_message["content"]
# Call your endpoint or AI application here
# response should be a string
response = call_to_your_application(query, messages_list, context)
formatted_response = {
"content": response,
"role": "assistant",
"context": "",
}
messages["messages"].append(formatted_response)
return {"messages": messages["messages"], "stream": stream, "session_state": session_state, "context": context}
The simulator initialization and invocation looks like this:
from azure.ai.evaluation.simulator import Simulator
model_config = {
"azure_endpoint": os.environ.get("AZURE_ENDPOINT"),
"azure_deployment": os.environ.get("AZURE_DEPLOYMENT_NAME"),
"api_version": os.environ.get("AZURE_API_VERSION"),
}
custom_simulator = Simulator(model_config=model_config)
outputs = asyncio.run(custom_simulator(
target=callback,
conversation_turns=[
[
"What should I know about the public gardens in the US?",
],
[
"How do I simulate data against LLMs",
],
],
max_conversation_turns=2,
))
with open("simulator_output.jsonl", "w") as f:
for output in outputs:
f.write(output.to_eval_qr_json_lines())
Adversarial Simulator
from azure.ai.evaluation.simulator import AdversarialSimulator, AdversarialScenario
from azure.identity import DefaultAzureCredential
# There are two ways to provide Azure AI Project.
# Option #1 : Using Azure AI Project
azure_ai_project = {
"subscription_id": <subscription_id>,
"resource_group_name": <resource_group_name>,
"project_name": <project_name>
}
# Option #2 : Using Azure AI Project Url
azure_ai_project = "https://{resource_name}.services.ai.azure.com/api/projects/{project_name}"
scenario = AdversarialScenario.ADVERSARIAL_QA
simulator = AdversarialSimulator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential())
outputs = asyncio.run(
simulator(
scenario=scenario,
max_conversation_turns=1,
max_simulation_results=3,
target=callback
)
)
print(outputs.to_eval_qr_json_lines())
For more details about the simulator, visit the following links:
ExamplesIn following section you will find examples of:
More examples can be found here.
Troubleshooting GeneralPlease refer to troubleshooting for common issues.
LoggingThis library uses the standard logging library for logging. Basic information about HTTP sessions (URLs, headers, etc.) is logged at INFO level.
Detailed DEBUG level logging, including request/response bodies and unredacted headers, can be enabled on a client with the logging_enable
argument.
See full SDK logging documentation with examples here.
Next stepsThis project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit cla.microsoft.com.
When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4