RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://cloud.google.com/vertex-ai/docs/text-data/entity-extraction/prepare-data below:

Prepare text training data for entity extraction | Vertex AI

Prepare text training data for entity extraction

Stay organized with collections Save and categorize content based on your preferences.

Starting on September 15, 2024, you can only customize classification, entity extraction, and sentiment analysis objectives by moving to Vertex AI Gemini prompts and tuning. Training or updating models for Vertex AI AutoML for Text classification, entity extraction, and sentiment analysis objectives will no longer be available. You can continue using existing Vertex AI AutoML Text models until June 15, 2025. For a comparison of AutoML text and Gemini, see Gemini for AutoML text users. For more information about how Gemini offers enhanced user experience through improved prompting capabilities, see Introduction to tuning. To get started with tuning, see Model tuning for Gemini text models

This page describes how to prepare text data for use in a Vertex AI dataset to train a entity extraction model.

Entity extraction training data consists of documents that are annotated with the labels that identify the types of entities that you want your model to identify. For example, you might create an entity extraction model to identify specialized terminology in legal documents or patents. Annotations specify the locations of the entities that you're labeling and the labels themselves.

If you're annotating structured or semi-structure documents for a dataset used to train AutoML models, such as invoices or contracts, Vertex AI can consider an annotation's position on the page as a factor contributing to its proper label. For example, a real estate contract has both an acceptance date and a closing date. Vertex AI can learn to distinguish between the entities based on the spatial position of the annotation.

Data requirements

You must supply at least 50, and no more than 100,000, training documents.
You must supply at least 1, and no more than 100, unique labels to annotate entities that you want to extract.
You can use a label to annotate between 1 and 10 words.
Label names can be between 2 and 30 characters.
You can include annotations in your JSON Lines files, or you can add annotations later by using the Google Cloud console after uploading documents.
You can include documents inline or reference TXT files that are in Cloud Storage buckets.

Best practices for text data used to train AutoML models

The following recommendations apply to datasets used to train AutoML models.

Use each label at least 200 times in your training dataset.
Annotate every occurrence of entities that you want your model to identify.

Input files

Input file types for entity extraction must be JSON Lines. The format, field names, and value types for JSON Lines files are determined by a schema file, which are publicly accessible YAML files.

You can download the schema file for entity extraction from the following Cloud Storage location:
gs://google-cloud-aiplatform/schema/dataset/ioformat/text_extraction_io_format_1.0.0.yaml.

The following example shows how you might use the schema to create your own JSON Lines file. The example includes line breaks for readability. In your JSON files, include line breaks only after each document. The dataItemResourceLabels field specifies, for example, ml_use and is optional.

{
    "textSegmentAnnotations": [
      {
        "startOffset":number,
        "endOffset":number,
        "displayName": "label"
      },
      ...
    ],
    "textContent": "inline_text",
    "dataItemResourceLabels": {
      "aiplatform.googleapis.com/ml_use": "training|test|validation"
    }
}
{
    "textSegmentAnnotations": [
      {
        "startOffset":number,
        "endOffset":number,
        "displayName": "label"
      },
      ...
    ],
    "textGcsUri": "gcs_uri_to_file",
    "dataItemResourceLabels": {
      "aiplatform.googleapis.com/ml_use": "training|test|validation"
    }
}

You can also annotate documents by using the Google Cloud console. Create a JSON Lines file with content only (without the textSegmentAnnotations field); documents are uploaded to Vertex AI without any annotations.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-08-07 UTC.

[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-07 UTC."],[],[]]

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4