The OpenAI gpt-oss models are introducing a lot of new concepts to the open-model ecosystem and getting them to perform as expected might take some time. This guide is meant to help developers building inference solutions to verify their implementations or for developers who want to test any provider’s implementation on their own to gain confidence.
The new models behave more similarly to some of our other OpenAI models than to existing open models. A couple of examples include:
For best performance we recommend inference providers to implement our Responses API format as the API shape was specifically designed for behaviors like outputting raw CoT along with summarized CoTs (to display to users) and tool calls without bolting additional properties onto a format. The most important
part for accurate performance is to return the raw CoT as part of the output
.
For this we added a new content
array to the Responses API’s reasoning
items. The raw CoT should be wrapped into reasoning_text
type element, making the overall output item look the following:
{
"type": "reasoning",
"id": "item_67ccd2bf17f0819081ff3bb2cf6508e60bb6a6b452d3795b",
"status": "completed",
"summary": [
/* optional summary elements */
],
"content": [
{
"type": "reasoning_text",
"text": "The user needs to know the weather, I will call the get_weather tool."
}
]
}
These items should be received in subsequent turns and then inserted back into the harmony formatted prompt as outlined in the raw CoT handling guide.
Check out the Responses API docs for the whole specification.
Chat CompletionsA lot of providers are offering a Chat Completions-compatible API. While we have not augmented our published API reference on the docs to provide a way to receive raw CoT, it’s still important that providers that offer the gpt-oss models via a Chat Completions-compatible API return the CoT as part of their messages and for developers to have a way to pass them back.
There is currently no generally agreed upon specification in the community with the general properties on a message being either reasoning
or reasoning_content
. To be compatible with clients like the OpenAI Agents SDK we recommend using a reasoning
field as the primary property for the raw CoT in Chat Completions.
To verify if a provider is working you can use the Node.js script published in our gpt-oss GitHub repository that you can also use to run other evals. You’ll need Node.js or a similar runtime installed to run the tests.
These tests will run a series of tool/function calling based requests to the Responses API or Chat Completions API you are trying to test. Afterwards they will evaluate both whether the right tool was called and whether the API shapes are correct.
This largely acts as a smoke test but should be a good indicator on whether the APIs are compatible with our SDKs and can handle basic function calling. It does not guarantee full accuracy of the inference implementation (see the evals section below for details on that) nor does it guarantee full compatibility with the OpenAI APIs. They should still be a helpful indicator of major implementation issues.
To run the test suite run the following commands:
# clone the repository
git clone https://github.com/openai/gpt-oss.git
# go into the compatibility test directory
cd gpt-oss/compatibility-test/
# install the dependencies
npm install
# change the provider config in providers.ts to add your provider
# run the tests
npm start -- --provider <your-provider-name>
Afterwards you should receive a result of both the API implementation and any details on the function call performance.
If your tests are successful, the output should show 0 invalid requests and over 90% on both pass@k and pass^k. This means the implementation should likely be correct. To be fully sure, you should also inspect the evals as described below.
If you want a detailed view of the individual responses, you can the jsonl
file that was created in your directory.
You can also enable debug mode to view any of the actual request payloads using DEBUG=openai-agents:openai npm start -- --provider <provider-name>
but it might get noisy. To run only one test use the -n 1
flag for easier debugging. For testing streaming events you can use --streaming
.
The team at Artificial Analysis is running AIME and GPQA evals for a variety of providers. If you are unsure about your provider, check out Artificial Analysis for the most recent metrics.
To be on the safe side you should consider running evals yourself. To run your own evals, you can find in the same repository as the test above a gpt_oss/evals
folder that contains the test harnesses that we used to verify the AIME (16 attempts per problem), GPQA (8 attempts per problem) and Healthbench (1 attempt per problem) evals for the vLLM implementation and some of our own reference implementations. You can use the same script to test your implementations.
To test a Responses API compatible API run:
python -m gpt_oss.evals --base-url http://localhost:8000/v1 --eval aime25 --sampler responses --model openai/gpt-oss-120b --reasoning-effort high
To test a Chat Completions API compatible API run:
python -m gpt_oss.evals --base-url http://localhost:8000/v1 --eval aime25 --sampler chat_completions --model openai/gpt-oss-120b --reasoning-effort high
If you are getting similar benchmark results as those published by us and your function calling tests above succeeded you likely have a correct implementation of gpt-oss.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4