Large Language Models (LLMs) are advanced machine learning models that excel in a wide range of language-related tasks such as text generation, translation, summarization, question answering, and more, without needing task-specific fine tuning for every scenario.
Modern LLMs are typically accessed through a chat model interface that takes a list of messages as input and returns a message as output.
The newest generation of chat models offer additional capabilities:
LangChain provides a consistent interface for working with chat models from different providers while offering additional features for monitoring, debugging, and optimizing the performance of applications that use LLMs.
with_structured_output
method.LangChain has many chat model integrations that allow you to use a wide variety of models from different providers.
These integrations are one of two types:
langchain-<provider>
packages.langchain-community
package.LangChain chat models are named with a convention that prefixes "Chat" to their class names (e.g., ChatOllama
, ChatAnthropic
, ChatOpenAI
, etc.).
Please review the chat model integrations for a list of supported models.
note
Models that do not include the prefix "Chat" in their name or include "LLM" as a suffix in their name typically refer to older models that do not follow the chat model interface and instead use an interface that takes a string as input and returns a string as output.
InterfaceβLangChain chat models implement the BaseChatModel interface. Because BaseChatModel
also implements the Runnable Interface, chat models support a standard streaming interface, async programming, optimized batching, and more. Please see the Runnable Interface for more details.
Many of the key methods of chat models operate on messages as input and return messages as output.
Chat models offer a standard set of parameters that can be used to configure the model. These parameters are typically used to control the behavior of the model, such as the temperature of the output, the maximum number of tokens in the response, and the maximum time to wait for a response. Please see the standard parameters section for more details.
note
In documentation, we will often use the terms "LLM" and "Chat Model" interchangeably. This is because most modern LLMs are exposed to users via a chat model interface.
However, LangChain also has implementations of older LLMs that do not follow the chat model interface and instead use an interface that takes a string as input and returns a string as output. These models are typically named without the "Chat" prefix (e.g., Ollama
, Anthropic
, OpenAI
, etc.). These models implement the BaseLLM interface and may be named with the "LLM" suffix (e.g., OllamaLLM
, AnthropicLLM
, OpenAILLM
, etc.). Generally, users should not use these models.
The key methods of a chat model are:
invoke
method for models that natively support structured output.Other important methods can be found in the BaseChatModel API Reference.
Inputs and outputsβModern LLMs are typically accessed through a chat model interface that takes messages as input and returns messages as output. Messages are typically associated with a role (e.g., "system", "human", "assistant") and one or more content blocks that contain text or potentially multimodal data (e.g., images, audio, video).
LangChain supports two message formats to interact with chat models:
Many chat models have standardized parameters that can be used to configure the model:
Parameter Descriptionmodel
The name or identifier of the specific AI model you want to use (e.g., "gpt-3.5-turbo"
or "gpt-4"
). temperature
Controls the randomness of the model's output. A higher value (e.g., 1.0) makes responses more creative, while a lower value (e.g., 0.0) makes them more deterministic and focused. timeout
The maximum time (in seconds) to wait for a response from the model before canceling the request. Ensures the request doesnβt hang indefinitely. max_tokens
Limits the total number of tokens (words and punctuation) in the response. This controls how long the output can be. stop
Specifies stop sequences that indicate when the model should stop generating tokens. For example, you might use specific strings to signal the end of a response. max_retries
The maximum number of attempts the system will make to resend a request if it fails due to issues like network timeouts or rate limits. api_key
The API key required for authenticating with the model provider. This is usually issued when you sign up for access to the model. base_url
The URL of the API endpoint where requests are sent. This is typically provided by the model's provider and is necessary for directing your requests. rate_limiter
An optional BaseRateLimiter to space out requests to avoid exceeding rate limits. See rate-limiting below for more details.
Some important things to note:
langchain-openai
, langchain-anthropic
, etc.), they're not enforced on models in langchain-community
.Chat models also accept other parameters that are specific to that integration. To find all the parameters supported by a Chat model head to the their respective API reference for that model.
Chat models can call tools to perform tasks such as fetching data from a database, making API requests, or running custom code. Please see the tool calling guide for more information.
Structured outputsβChat models can be requested to respond in a particular format (e.g., JSON or matching a particular schema). This feature is extremely useful for information extraction tasks. Please read more about the technique in the structured outputs guide.
MultimodalityβLarge Language Models (LLMs) are not limited to processing text. They can also be used to process other types of data, such as images, audio, and video. This is known as multimodality.
Currently, only some LLMs support multimodal inputs, and almost none support multimodal outputs. Please consult the specific model documentation for details.
Context windowβA chat model's context window refers to the maximum size of the input sequence the model can process at one time. While the context windows of modern LLMs are quite large, they still present a limitation that developers must keep in mind when working with chat models.
If the input exceeds the context window, the model may not be able to process the entire input and could raise an error. In conversational applications, this is especially important because the context window determines how much information the model can "remember" throughout a conversation. Developers often need to manage the input within the context window to maintain a coherent dialogue without exceeding the limit. For more details on handling memory in conversations, refer to the memory.
The size of the input is measured in tokens which are the unit of processing that the model uses.
Advanced topicsβ Rate-limitingβMany chat model providers impose a limit on the number of requests that can be made in a given time period.
If you hit a rate limit, you will typically receive a rate limit error response from the provider, and will need to wait before making more requests.
You have a few options to deal with rate limits:
rate_limiter
parameter that can be provided during initialization. This parameter is used to control the rate at which requests are made to the model provider. Spacing out the requests to a given model is a particularly useful strategy when benchmarking models to evaluate their performance. Please see the how to handle rate limits for more information on how to use this feature.max_retries
parameter that can be used to control the number of retries. See the standard parameters section for more information.Chat model APIs can be slow, so a natural question is whether to cache the results of previous conversations. Theoretically, caching can help improve performance by reducing the number of requests made to the model provider. In practice, caching chat model responses is a complex problem and should be approached with caution.
The reason is that getting a cache hit is unlikely after the first or second interaction in a conversation if relying on caching the exact inputs into the model. For example, how likely do you think that multiple conversations start with the exact same message? What about the exact same three messages?
An alternative approach is to use semantic caching, where you cache responses based on the meaning of the input rather than the exact input itself. This can be effective in some situations, but not in others.
A semantic cache introduces a dependency on another model on the critical path of your application (e.g., the semantic cache may rely on an embedding model to convert text to a vector representation), and it's not guaranteed to capture the meaning of the input accurately.
However, there might be situations where caching chat model responses is beneficial. For example, if you have a chat model that is used to answer frequently asked questions, caching responses can help reduce the load on the model provider, costs, and improve response times.
Please see the how to cache chat model responses guide for more details.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4