# Simple evaluation unitxt-evaluate \ --tasks "card=cards.mmlu_pro.engineering" \ --model cross_provider \ --model_args "model_name=llama-3-1-8b-instruct" \ --limit 10 # Multi-task evaluation unitxt-evaluate \ --tasks "card=cards.text2sql.bird+card=cards.mmlu_pro.engineering" \ --model cross_provider \ --model_args "model_name=llama-3-1-8b-instruct,max_tokens=256" \ --split test \ --limit 10 \ --output_path ./results/evaluate_cli \ --log_samples \ --apply_chat_template # Benchmark evaluation unitxt-evaluate \ --tasks "benchmarks.tool_calling" \ --model cross_provider \ --model_args "model_name=llama-3-1-8b-instruct,max_tokens=256" \ --split test \ --limit 10 \ --output_path ./results/evaluate_cli \ --log_samples \ --apply_chat_template
Load thousands of datasets in chat API format, ready for any model:
from unitxt import load_dataset dataset = load_dataset( card="cards.gpqa.diamond", split="test", format="formats.chat_api", )📊 Available on The Catalog
Launch the graphical user interface to explore datasets and benchmarks:
pip install unitxt[ui]
unitxt-explore
Evaluate your own data with any model:
# Import required components from unitxt import evaluate, create_dataset from unitxt.blocks import Task, InputOutputTemplate from unitxt.inference import HFAutoModelInferenceEngine # Question-answer dataset data = [ {"question": "What is the capital of Texas?", "answer": "Austin"}, {"question": "What is the color of the sky?", "answer": "Blue"}, ] # Define the task and evaluation metric task = Task( input_fields={"question": str}, reference_fields={"answer": str}, prediction_type=str, metrics=["metrics.accuracy"], ) # Create a template to format inputs and outputs template = InputOutputTemplate( instruction="Answer the following question.", input_format="{question}", output_format="{answer}", postprocessors=["processors.lower_case"], ) # Prepare the dataset dataset = create_dataset( task=task, template=template, format="formats.chat_api", test_set=data, split="test", ) # Set up the model (supports Hugging Face, WatsonX, OpenAI, etc.) model = HFAutoModelInferenceEngine( model_name="Qwen/Qwen1.5-0.5B-Chat", max_new_tokens=32 ) # Generate predictions and evaluate predictions = model(dataset) results = evaluate(predictions=predictions, data=dataset) # Print results print("Global Results:\n", results.global_scores.summary) print("Instance Results:\n", results.instance_scores.summary)
Read the contributing guide for details on how to contribute to Unitxt.
If you use Unitxt in your research, please cite our paper:
@inproceedings{bandel-etal-2024-unitxt, title = "Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative {AI}", author = "Bandel, Elron and Perlitz, Yotam and Venezian, Elad and Friedman, Roni and Arviv, Ofir and Orbach, Matan and Don-Yehiya, Shachar and Sheinwald, Dafna and Gera, Ariel and Choshen, Leshem and Shmueli-Scheuer, Michal and Katz, Yoav", editor = "Chang, Kai-Wei and Lee, Annie and Rajani, Nazneen", booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: System Demonstrations)", month = jun, year = "2024", address = "Mexico City, Mexico", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.naacl-demo.21", pages = "207--215", }
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4