Cloudinary is a cloud-based service that provides solutions for image and video management. These include server or client-side upload, on-the-fly image and video transformations, fast CDN delivery, and a variety of asset management options.
The Cloudinary AI Vision add-on is a service utilizing LLM (Large Language Model) capabilities, specialized models, advanced algorithms, prompt engineering, and Cloudinary's knowledge, to interpret and respond to visual content queries, providing answers to questions (e.g., "Are there flowers?") and requests (e.g., "Describe this image") about an image's content. By seamlessly integrating visual and textual data, AI Vision provides a more holistic and adaptable understanding of content, enabling businesses to tailor solutions that align closely with their unique brand and customer expectations, thus securing a substantial competitive advantage.
AI Vision is designed to cater to a variety of needs across different industries, streamlining content moderation, media classification and understanding content, and providing a powerful tool that automates the analysis, tagging, and moderation of visual content.
AI Vision uses the
Analyze APIand doesn't require the image to be stored in your Cloudinary account. The AI Vision methods accept either the
asset_id
of an image in your Cloudinary account, or a valid
uri
to an image.
Getting startedBefore you can use the Cloudinary AI Vision add-on:
You must have a Cloudinary account. If you don't already have one, you can sign up for a free account.
Register for the add-on: make sure you're logged in to your account and then go to the Add-ons page. For more information about add-on registrations, see Registering for add-ons.
Keep in mind that many of the examples on this page use our SDKs. For SDK installation and configuration details, see the relevant SDK guide.
If you're new to Cloudinary, you may want to take a look at the Developer Kickstart for a hands-on, step-by-step introduction to a variety of features.
AI Vision offers scalable solutions for handling large volumes of media assets to provide a seamless, ready-to-use experience, enabling users to integrate effortlessly without having to do any complex customizations or prompt engineering. The add-on supports the following modes:
The Tagging mode accepts a list of tag names along with their corresponding descriptions. If the image matches the description, which may encompass various elements, the response will be appropriately tagged. This approach enables customers to align with their own brand taxonomy, offering a dynamic, flexible, and open method for image classification.
To return the tags for an image based on provided definitions you call the ai_vision_tagging
method with the following parameters:
source
: The image to be analyzed. Either a uri
or an asset_id
can be specified.tag_definitions
: A list of tag definitions containing names and descriptions (max 10).The Moderation mode accepts multiple questions about an image, to which the response provides concise answers of "yes," "no," or "unknown." This functionality allows for a nuanced evaluation of whether the image adheres to specific content policies, creative specs, or aesthetic criteria.
To evaluate images against specific moderation questions you call the ai_vision_moderation
method with the following parameters:
source
: The image to be analyzed. Either a uri
or an asset_id
can be specified.rejection_questions
: A list of yes/no questions to ask (max 10).The General mode serves a wide array of applications by providing detailed answers to diverse questions about an image. Users can inquire about any aspect of an image, such as identifying objects, understanding scenes, or interpreting text within the image.
To ask general questions you call the ai_vision_general
method with the following parameters:
source
: The image to be analyzed. Either a uri
or an asset_id
can be specified.prompts
: A list of questions or requests to ask (max 10).Your AI Vision Add-on quota is based on tokens. A token is a unit of measurement, similar to a word, used to quantify the processing required. Tokens can represent both text and images, with pricing based on the number of tokens processed.
Consolidating into token count provides a clear understanding of the total token used.
Every response also includes a limits
node with the number of tokens used by the operation. For example:
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4