LMQL support various decoding algorithms, which are used to generate text from the token distribution of a language model. For this, decoding algorithm in use, can be specified right at the beginning of a query, e.g. using a decoder keyword like argmax
.
All supported decoding algorithms are model-agnostic and can be used with any LMQL-supported inference backend. For more information on the supported inference backends, see the Models chapter.
Setting The Decoding Algorithm Depending on context, LMQL offers two ways to specify the decoding algorithm to use.
Decoder Configuration as part of the query: The first option is to simply specify the decoding algorithm and its parameters as part of the query itself. This can be particularly useful, if your choice of decoder is relevant to the concrete program you are writing.
lmql# use beam search with beam width 2 for
# the entire program
beam(n=2)
# uses beam search to generate RESPONSE
"This is a query with a specified decoder: [RESPONSE]"
Decoding algorithms are always specified for the entire query program, and cannot change within a program. To use different decoders for different parts of your program, you have to split your program into multiple queries.
Specifying the Decoding Algorithm Externally: The second option is to specify the decoding algorithm and parameters externally, i.e. separatly from the actual program code:
pythonimport lmql
@lmql.query(model="openai/text-davinci-003", decoder="sample", temperature=1.8)
def tell_a_joke():
'''lmql
"""A list good dad joke. A indicates the punchline:
Q:[JOKE]
A:[PUNCHLINE]""" where STOPS_AT(JOKE, "?") and STOPS_AT(PUNCHLINE, "\n")
'''
tell_a_joke() # uses the decoder specified in @lmql.query(...)
tell_a_joke(decoder="beam", n=2) # uses a beam search decoder with n=2
This is only possible when using LMQL from a Python context.
Decoding Algorithms In general, the very first keyword of an LMQL query, specifies the decoding algorithm to use. For this, the following decoder keywords are available:
argmax
The argmax
decoder is the simplest decoder available in LMQL. It greedily selects the most likely token at each step of the decoding process. It has no additional parameters. Since argmax
decoding is deterministic, one can only generate a single sequence at a time.
sample(n: int, temperature: float)
The sample
decoder samples n
sequences in parallel from the model. The temperature
parameter controls the randomness of the sampling process. Higher values of temperature
lead to more random samples, while lower values lead to more likely samples. A temperature value of 0.0
is equivalent to the argmax
decoder.
beam(n: int)
A simple beam search decoder. The n
parameter controls the beam size. The beam search decoder is deterministic, so it will generate the same n
sequences every time. The result of a beam
query is a list of n
sequences, sorted by their likelihood.
beam_sample(n: int, temperature: float)
A beam search decoder that samples from the beam at each step. The n
parameter controls the beam size, while the temperature
parameter controls the randomness of the sampling process. The result of a beam_sample
query is a list of n
sequences, sorted by their likelihood.
LMQL also implements a number of novel decoders. These decoders are experimental and may not work as expected. They are also not guaranteed to be stable across different LMQL versions. More documentation on these decoders will be provided in the future.
var(b: int, n: int)
An experimental implementation of variable-level beam search.
beam_var(n: int)
An experimental implementation of a beam search procedure that groups by currently-decoded variable and applies adjusted length penalties.
Inspecting Decoding Trees LMQL also provides a way to inspect the decoding trees generated by the decoders. For this, make sure to execute the query in the Playground IDE and click on the Advanced Mode
button, in the top right corner of the Playground. This will open a new pane, where you can navigate and inspect the LMQL decoding tree:
This view allows you to track the decoding process, active hypotheses and interpreter state, including the current evaluation result of the where
clause. For an example, take a look at the translation example in the Playground (with Advanced Mode enabled).
LMQL also includes a library for array-based decoding dclib
, which can be used to implement custom decoders. More information on this, will be provided in the future. The implementation of the available decoding procedures is located in src/lmql/runtime/dclib/decoders.py
of the LMQL repository.
Next to the decoding algorithm, LMQL also supports a number of additional decoding parameters, which can affect sampling behavior and token scoring:
max_len: int
The maximum length of the generated sequence. If not specified, the default value of max_len
is 2048
. Note if the maximum length is reached, the LMQL runtime will throw an error if the query has not yet come to a valid result, according to the provided where
clause. top_k: int
Restricts the number of tokens to sample from in each step of the decoding process, based on Fan et. al(2018) (only applicable for sampling decoders). top_p: float
Top-p (nucleus) sampling, based on Holtzman et. al(2019) (only applicable for sampling decoders). repetition_penalty: float
Repetition penalty, 1.0
means no penalty, based on Keskar et. al(2019). The more a token is already present in the generated sequence, the more its probability will be penalized. frequency_penalty: float
frequency_penalty
as documented as part of the OpenAI API. presence_penalty: float
presence_penalty
as documented as part of the OpenAI API.
TIP
Note that the concrete implementation and availability of additional decoding parameters may vary across different inference backends. For reference, please see the API documentation of the respective inference interface, e.g. the HuggingFace generate()
function or the OpenAI API.
Lastly, a number of additional runtime parameters are available, which can be used to control auxiliary aspects of the decoding process:
chunksize: int
The chunksize parameter used for max_tokens
in OpenAI API requests or in speculative inference with local models. If not specified, the default value of chunksize
is 32
. See also the description of this parameter in the Models chapter. verbose: bool
Enables verbose console logging for individual LLM inference calls (local generation calls or OpenAI API request payloads). cache: Union[bool,str]
True
or False
to enable in-memory token caching. If not specified, the default value of cache
is True
, indicating in-memory caching is enabled.
Setting cache
to a string value, specifies a local file to use for disk-based caching, enabling caching across multiple query executions and sessions.
openai_nonstop
Experimental option for OpenAI-specific non-stop generation, which can further improve the effectiveness of caching in some scenarios. chunk_timeout
OpenAI-specific maximum time in seconds to wait for the next chunk of tokens to arrive. If exceeded, the current API request will be retried with an approriate backoff.
If not specified, the default value of chunk_timeout
is 2.5
. Adjust this parameter, if you are seeing a high number of timeouts in the console output of the LMQL runtime.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4