Code for the paper "MLE-Bench: Evaluating Machine Learning Agents on Machine Learning Engineering". We have released the code used to construct the dataset, the evaluation logic, as well as the agents we evaluated for this benchmark.
Agent Low == Lite (%) Medium (%) High (%) All (%) Date Grading Reports Available Neo multi-agent 48.48 ± 1.52 29.82 ± 2.32 24.44 ± 2.22 34.22 ± 0.89 2025-07-28 ✓ ML-Master deepseek-r1 48.5 ± 1.5 20.2 ± 2.3 24.4 ± 2.2 29.3 ± 0.8 2025-06-17 ✓ RD-Agent o1-preview 48.18 ± 2.49 8.95 ± 2.36 18.67 ± 2.98 22.4 ± 1.1 2025-05-14 ✓ AIDE o1-preview 34.3 ± 2.4 8.8 ± 1.1 10.0 ± 1.9 16.9 ± 1.1 2024-10-08 ✓ AIDE gpt-4o-2024-08-06 19.0 ± 1.3 3.2 ± 0.5 5.6 ± 1.0 8.6 ± 0.5 2024-10-08 ✓ AIDE claude-3-5-sonnet-20240620 19.4 ± 4.9 2.6 ± 1.5 2.3 ± 2.3 7.5 ± 1.8 2024-10-08 ✓ OpenHands gpt-4o-2024-08-06 11.5 ± 3.4 2.2 ± 1.3 1.9 ± 1.9 5.1 ± 1.3 2024-10-08 ✓ AIDE llama-3.1-405b-instruct 8.3 ± 2.6 1.2 ± 0.8 0.0 ± 0.0 3.1 ± 0.9 2024-10-08 ✓ MLAB gpt-4o-2024-08-06 4.2 ± 1.5 0.0 ± 0.0 0.0 ± 0.0 1.3 ± 0.5 2024-10-08 ✓This section describes a canonical setup for comparing scores on MLE-bench. We recommend the following:
Evaluating agents with the above settings on the full 75 competitions of MLE-bench can be expensive. For users preferring a "lite" version of the benchmark, we recommend using the Low complexity split of our dataset, which consists of only 22 competitions. This reduces the number of runs substantially, while still allowing fair comparison along one column of the table above.
Furthermore, the Low complexity competitions tend to be significantly more lightweight (158GB total dataset size compared to 3.3TB for the full set), so users may additionally consider reducing the runtime or compute resources available to the agents for further cost reduction. However, note that doing so risks degrading the performance of your agent. For example, see Section 3.3 and 3.4 of our paper where we have experimented with varying resources on the full competition set.
The Lite dataset contains the following competitions:
Competition ID Category Dataset Size (GB) aerial-cactus-identification Image Classification 0.0254 aptos2019-blindness-detection Image Classification 10.22 denoising-dirty-documents Image To Image 0.06 detecting-insults-in-social-commentary Text Classification 0.002 dog-breed-identification Image Classification 0.75 dogs-vs-cats-redux-kernels-edition Image Classification 0.85 histopathologic-cancer-detection Image Regression 7.76 jigsaw-toxic-comment-classification-challenge Text Classification 0.06 leaf-classification Image Classification 0.036 mlsp-2013-birds Audio Classification 0.5851 new-york-city-taxi-fare-prediction Tabular 5.7 nomad2018-predict-transparent-conductors Tabular 0.00624 plant-pathology-2020-fgvc7 Image Classification 0.8 random-acts-of-pizza Text Classification 0.003 ranzcr-clip-catheter-line-classification Image Classification 13.13 siim-isic-melanoma-classification Image Classification 116.16 spooky-author-identification Text Classification 0.0019 tabular-playground-series-dec-2021 Tabular 0.7 tabular-playground-series-may-2022 Tabular 0.57 text-normalization-challenge-english-language Seq->Seq 0.01 text-normalization-challenge-russian-language Seq->Seq 0.01 the-icml-2013-whale-challenge-right-whale-redux Audio Classification 0.29314Some MLE-bench competition data is stored using Git-LFS. Once you have downloaded and installed LFS, run:
git lfs fetch --all git lfs pull
You can install mlebench
with pip:
If you're committing code, you can install the pre-commit hooks by running:
The MLE-bench dataset is a collection of 75 Kaggle competitions which we use to evaluate the ML engineering capabilities of AI systems.
Since Kaggle does not provide the held-out test set for each competition, we provide preparation scripts that split the publicly available training set into a new training and test set.
For each competition, we also provide grading scripts that can be used to evaluate the score of a submission.
We use the Kaggle API to download the raw datasets. Ensure that you have downloaded your Kaggle credentials (kaggle.json
) and placed it in the ~/.kaggle/
directory (this is the default location where the Kaggle API looks for your credentials). To download and prepare the MLE-bench dataset, run the following, which will download and prepare the dataset in your system's default cache directory. Note, we've found this to take two days when running from scratch:
To prepare the lite dataset, run:
Alternatively, you can prepare the dataset for a specific competition by running:
mlebench prepare -c <competition-id>
Run mlebench prepare --help
to see the list of available competitions.
Answers for competitions must be submitted in CSV format; the required format is described in each competition's description, or shown in a competition's sample submission file. You can grade multiple submissions by using the mlebench grade
command. Given a JSONL file, where each line corresponds with a submission for one competition, mlebench grade
will produce a grading report for each competition. The JSONL file must contain the following fields:
competition_id
: the ID of the competition in our dataset.submission_path
: a .csv
file with the predictions for the specified competition.See more information by running mlebench grade --help
.
You can also grade individual submissions using the mlebench grade-sample
command. For example, to grade a submission for the Spaceship Titanic competition, you can run:
mlebench grade-sample <PATH_TO_SUBMISSION> spaceship-titanic
See more information by running mlebench grade-sample --help
.
We provide a base Docker image mlebench-env
which is the base environment for our agents. This base image contains:
INSTALL_HEAVY_DEPENDENCIES
environment variable to false
when building the image, by adding --build-arg INSTALL_HEAVY_DEPENDENCIES=false
to the docker build
command belowBuild this image by running:
docker build --platform=linux/amd64 -t mlebench-env -f environment/Dockerfile .
We purposefully designed our benchmark to not make any assumptions about the agent that produces submissions, so agents can more easily be evaluated on this benchmark. We evaluated three open-source agents; we discuss this procedure in agents/README.md.
We include additional features in the MLE-bench repository that may be useful for MLE-bench evaluation. These include a rule violation detector and a plagiarism detector. We refer readers to extras/README.md for more information.
We collect example usage of this library in the examples/
directory, see examples/README.md for more information.
We place the code specific to the experiments from our publication of the benchmark in the experiments/
directory:
experiments/splits/
.experiments/make_submission.py
script to compile its submission for grading.experiments/familiarity/
, see experiments/familiarity/README.md for more information.Note, when running pytest
locally, be sure to accept the competition rules otherwise the tests will fail.
Chan Jun Shern, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, Aleksander Mądry
Please cite using the following BibTeX entry:
@article{chan2024mle-bench,
title={MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering},
author={Jun Shern Chan and Neil Chowdhury and Oliver Jaffe and James Aung and Dane Sherburn and Evan Mays and Giulio Starace and Kevin Liu and Leon Maksin and Tejal Patwardhan and Lilian Weng and Aleksander Mądry},
year={2024},
eprint={2410.07095},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.07095}
}
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4