An open-source framework to evaluate, test and monitor ML and LLM-powered systems.
Documentation | Discord Community | Blog | Twitter | Evidently Cloud
Evidently is an open-source Python library to evaluate, test, and monitor ML and LLM systemsβfrom experiments to production.
Evidently is very modular. You can start with one-off evaluations or host a full monitoring service.
1. Reports and Test SuitesReports compute and summarize various data, ML and LLM quality evals.
Turn any Report into a Test Suite by adding pass/fail conditions.
gt
(greater than), lt
(less than), etc.Monitoring UI service helps visualize metrics and test results over time.
You can choose:
Evidently Cloud offers a generous free tier and extra features like dataset and user management, alerting, and no-code evals. Compare OSS vs Cloud.
DashboardTo install from PyPI:
To install Evidently using conda installer, run:
conda install -c conda-forge evidently
This is a simple Hello World. Check the Tutorials for more: LLM evaluation.
Import the necessary components:
import pandas as pd from evidently import Report from evidently import Dataset, DataDefinition from evidently.descriptors import Sentiment, TextLength, Contains from evidently.presets import TextEvals
Create a toy dataset with questions and answers.
eval_df = pd.DataFrame([ ["What is the capital of Japan?", "The capital of Japan is Tokyo."], ["Who painted the Mona Lisa?", "Leonardo da Vinci."], ["Can you write an essay?", "I'm sorry, but I can't assist with homework."]], columns=["question", "answer"])
Create an Evidently Dataset object and add descriptors
: row-level evaluators. We'll check for sentiment of each response, its length and whether it contains words indicative of denial.
eval_dataset = Dataset.from_pandas(pd.DataFrame(eval_df), data_definition=DataDefinition(), descriptors=[ Sentiment("answer", alias="Sentiment"), TextLength("answer", alias="Length"), Contains("answer", items=['sorry', 'apologize'], mode="any", alias="Denials") ])
You can view the dataframe with added scores:
eval_dataset.as_dataframe()
To get a summary Report to see the distribution of scores:
report = Report([ TextEvals() ]) my_eval = report.run(eval_dataset) my_eval # my_eval.json() # my_eval.dict()
You can also choose other evaluators, including LLM-as-a-judge and configure pass/fail conditions.
This is a simple Hello World. Check the Tutorials for more: Tabular data.
Import the Report, evaluation Preset and toy tabular dataset.
import pandas as pd from sklearn import datasets from evidently import Report from evidently.presets import DataDriftPreset iris_data = datasets.load_iris(as_frame=True) iris_frame = iris_data.frame
Run the Data Drift evaluation preset that will test for shift in column distributions. Take the first 60 rows of the dataframe as "current" data and the following as reference. Get the output in Jupyter notebook:
report = Report([ DataDriftPreset(method="psi") ], include_tests="True") my_eval = report.run(iris_frame.iloc[:60], iris_frame.iloc[60:]) my_eval
You can also save an HTML file. You'll need to open it from the destination folder.
my_eval.save_html("file.html")
To get the output as JSON or Python dictionary:
my_eval.json() # my_eval.dict()
You can choose other Presets, create Reports from indiviudal Metrics and configure pass/fail conditions.
This launches a demo project in the locally hosted Evidently UI. Sign up for Evidently Cloud to instantly get a managed version with additional features.
Recommended step: create a virtual environment and activate it.
pip install virtualenv
virtualenv venv
source venv/bin/activate
After installing Evidently (pip install evidently
), run the Evidently UI with the demo projects:
evidently ui --demo-projects all
Visit localhost:8000 to access the UI.
π¦ What can you evaluate?Evidently has 100+ built-in evals. You can also add custom ones.
Here are examples of things you can check:
π‘ Text descriptors π LLM outputs Length, sentiment, toxicity, language, special symbols, regular expression matches, etc. Semantic similarity, retrieval relevance, summarization quality, etc. with model- and LLM-based evals. π’ Data quality π Data distribution drift Missing values, duplicates, min-max ranges, new categorical values, correlations, etc. 20+ statistical tests and distance metrics to compare shifts in data distribution. π― Classification π Regression Accuracy, precision, recall, ROC AUC, confusion matrix, bias, etc. MAE, ME, RMSE, error distribution, error normality, error bias, etc. π Ranking (inc. RAG) π Recommendations NDCG, MAP, MRR, Hit Rate, etc. Serendipity, novelty, diversity, popularity bias, etc.We welcome contributions! Read the Guide to learn more.
For more examples, refer to a complete Documentation.
If you want to chat and connect, join our Discord community!
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4