To add a new task, first either open an issue, to determine whether it will be integrated in the core evaluations of lighteval, in the extended tasks, or the community tasks, and add its dataset on the hub.
A popular community evaluation can move to become an extended or core evaluation over time.
You can find examples of custom tasks in the community_task directory.
Step by step creation of a custom taskTo contribute your custom metric to the lighteval repo, you would first need to install the required dev dependencies by running pip install -e .[dev]
and then run pre-commit install
to install the pre-commit hooks.
First, create a python file under the community_tasks
directory.
You need to define a prompt function that will convert a line from your dataset to a document to be used for evaluation.
def prompt_fn(line, task_name: str = None): """Defines how to go from a dataset line to a doc object. Follow examples in src/lighteval/tasks/default_prompts.py, or get more info about what this function should do in the README. """ return Doc( task_name=task_name, query=line["question"], choices=[f" {c}" for c in line["choices"]], gold_index=line["gold"], instruction="", )
Then, you need to choose a metric: you can either use an existing one (defined in lighteval.metrics.metrics.Metrics
) or create a custom one).
custom_metric = SampleLevelMetric( metric_name="my_custom_metric_name", higher_is_better=True, category=MetricCategory.IGNORED, use_case=MetricUseCase.NONE, sample_level_fn=lambda x: x, corpus_level_fn=np.mean, )
Then, you need to define your task using LightevalTaskConfig. You can define a task with or without subsets. To define a task with no subsets:
task = LightevalTaskConfig( name="myothertask", prompt_function=prompt_fn, suite=["community"], hf_repo="", hf_subset="default", hf_avail_splits=[], evaluation_splits=[], few_shots_split=None, few_shots_select=None, metric=[], )
If you want to create a task with multiple subset, add them to the SAMPLE_SUBSETS
list and create a task for each subset.
SAMPLE_SUBSETS = [] class CustomSubsetTask(LightevalTaskConfig): def __init__( self, name, hf_subset, ): super().__init__( name=name, hf_subset=hf_subset, prompt_function=prompt_fn, hf_repo="", metric=[custom_metric], hf_avail_splits=[], evaluation_splits=[], few_shots_split=None, few_shots_select=None, suite=["community"], generation_size=-1, stop_sequence=None, ) SUBSET_TASKS = [CustomSubsetTask(name=f"mytask:{subset}", hf_subset=subset) for subset in SAMPLE_SUBSETS]
Here is a list of the parameters and their meaning:
name
(str), your evaluation namesuite
(list), the suite(s) to which your evaluation should belong. This field allows us to compare different task implementations and is used as a task selection to differentiate the versions to launch. At the moment, you’ll find the keywords [“helm”, “bigbench”, “original”, “lighteval”, “community”, “custom”]; for core evals, please choose lighteval
.prompt_function
(Callable), the prompt function you defined in the step abovehf_repo
(str), the path to your evaluation dataset on the hubhf_subset
(str), the specific subset you want to use for your evaluation (note: when the dataset has no subset, fill this field with "default"
, not with None
or ""
)hf_avail_splits
(list), all the splits available for your dataset (train, valid or validation, test, other…)evaluation_splits
(list), the splits you want to use for evaluationfew_shots_split
(str, can be null
), the specific split from which you want to select samples for your few-shot examples. It should be different from the sets included in evaluation_splits
few_shots_select
(str, can be null
), the method that you will use to select items for your few-shot examples. Can be null
, or one of:
balanced
select examples from the few_shots_split
with balanced labels, to avoid skewing the few shot examples (hence the model generations) toward one specific labelrandom
selects examples at random from the few_shots_split
random_sampling
selects new examples at random from the few_shots_split
for every new item, but if a sampled item is equal to the current one, it is removed from the available samplesrandom_sampling_from_train
selects new examples at random from the few_shots_split
for every new item, but if a sampled item is equal to the current one, it is kept! Only use this if you know what you are doing.sequential
selects the first n
examples of the few_shots_split
generation_size
(int), the maximum number of tokens allowed for a generative evaluation. If your evaluation is a log likelihood evaluation (multi-choice), this value should be -1stop_sequence
(list), a list of strings acting as end of sentence tokens for your generationmetric
(list), the metrics you want to use for your evaluation (see next section for a detailed explanation)trust_dataset
(bool), set to True if you trust the dataset.Then you need to add your task to the TASKS_TABLE
list.
TASKS_TABLE = SUBSET_TASKS
Once your file is created you can then run the evaluation with the following command:
lighteval accelerate \ "model_name=HuggingFaceH4/zephyr-7b-beta" \ "community|{custom_task}|{fewshots}|{truncate_few_shot}" \ --custom-tasks {path_to_your_custom_task_file}< > Update on GitHub
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4