Evaluation, benchmark, and scorecard, targeting for performance on throughput and latency, accuracy on popular evaluation harness, safety, and hallucination
pip install -r requirements.txt pip install opea-eval
notes: We have to install requirements.txt at first, cause Pypi can't have direct dependency with specific commit.
git clone https://github.com/opea-project/GenAIEval cd GenAIEval pip install -r requirements.txt pip install -e .
For evaluating the models on text-generation tasks, we follow the lm-evaluation-harness and provide the command line usage and function call usage. Over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented, such as ARC
, HellaSwag
, MMLU
, TruthfulQA
, Winogrande
, GSM8K
and so on.
# pip install --upgrade-strategy eager optimum[habana] cd evals/evaluation/lm_evaluation_harness/examples python main.py \ --model gaudi-hf \ --model_args pretrained=EleutherAI/gpt-j-6B \ --tasks hellaswag \ --device hpu \ --batch_size 8
cd evals/evaluation/lm_evaluation_harness/examples python main.py \ --model hf \ --model_args pretrained=EleutherAI/gpt-j-6B \ --tasks hellaswag \ --device cpu \ --batch_size 8
from evals.evaluation.lm_evaluation_harness import LMEvalParser, evaluate args = LMevalParser( model="hf", user_model=user_model, tokenizer=tokenizer, tasks="hellaswag", device="cpu", batch_size=8, ) results = evaluate(args)
setup a separate server with GenAIComps
# build cpu docker
docker build -f Dockerfile.cpu -t opea/lm-eval:latest .
# start the server
docker run -p 9006:9006 --ipc=host -e MODEL="hf" -e MODEL_ARGS="pretrained=Intel/neural-chat-7b-v3-3" -e DEVICE="cpu" opea/lm-eval:latest
evaluate the model
set base_url
, tokenizer
and --model genai-hf
cd evals/evaluation/lm_evaluation_harness/examples
python main.py \
--model genai-hf \
--model_args "base_url=http://{your_ip}:9006,tokenizer=Intel/neural-chat-7b-v3-3" \
--tasks "lambada_openai" \
--batch_size 2
For evaluating the models on coding tasks or specifically coding LLMs, we follow the bigcode-evaluation-harness and provide the command line usage and function call usage. HumanEval, HumanEval+, InstructHumanEval, APPS, MBPP, MBPP+, and DS-1000 for both completion (left-to-right) and insertion (FIM) mode are available.
cd evals/evaluation/bigcode_evaluation_harness/examples python main.py \ --model "codeparrot/codeparrot-small" \ --tasks "humaneval" \ --n_samples 100 \ --batch_size 10 \ --allow_code_execution
from evals.evaluation.bigcode_evaluation_harness import BigcodeEvalParser, evaluate args = BigcodeEvalParser( user_model=user_model, tokenizer=tokenizer, tasks="humaneval", n_samples=100, batch_size=10, allow_code_execution=True, ) results = evaluate(args)Kubernetes platform optimization
Node resource management helps optimizing AI container performance and isolation on Kubernetes nodes. See Platform optimization.
We provide a OPEA microservice benchmarking tool which is designed for microservice performance testing and benchmarking. It allows you to define test cases for various services based on YAML configurations, run load tests using stresscli
, built on top of locust, and analyze the results for performance insights.
Define Test Cases: Configure your tests in the benchmark.yaml file.
Increase File Descriptor Limit (if running large-scale tests):
This ensures the system can handle high concurrency by allowing more open files and connections.
Run the benchmark script:
cd evals/benchmark/ python benchmark.py
NOTE: benchmark.py will take benchmark.yaml file as test case data.
By giving a --yaml argument with a custom yaml file, benchmark.py could take a custom test case data.
e.g. Test data for the examples starting by docker compose on HPU.
cd evals/benchmark/ python benchmark.py --yaml docker.hpu.benchmark.yaml
Results will be saved in the directory specified by test_output_dir
in the configuration.
For more details on configuring test cases, refer to the README.
Prometheus metrics collected during the tests can be used to create Grafana dashboards for visualizing performance trends and monitoring bottlenecks. For more information, refer to the Grafana README
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4