RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from http://reference.wolfram.com/language/ref/netencoder/Tokens.html below:

Tokens—Wolfram Documentation

WOLFRAM Consulting & Solutions

We deliver solutions for the AI eraâcombining symbolic computation, data-driven insights and deep technical expertise

Data & Computational Intelligence
Model-Based Design
Algorithm Development
Wolfram|Alpha for Business
Blockchain Technology
Education Technology
Quantum Computation

WolframConsulting.com

NetEncoder["Tokens"]

represents an encoder that converts the words in a string to a sequence of integer codes using a standard English vocabulary.

NetEncoder[{"Tokens","language"}]

represents an encoder that uses a standard vocabulary for the given language.

NetEncoder[{"Tokens",{token₁,token₂,…}}]

represents an encoder that uses a specified list of tokens as the vocabulary.

NetEncoder[{"Tokens",…,"param"value}]

represents an encoder in which additional parameters have been specified.

Details

NetEncoder[…][input] applies the encoder to an input to produce an output.
NetEncoder[…][{input₁,input₂,…}] applies the encoder to a list of strings to produce a list of outputs.
The input to the encoder must be a string or a TextElement with a sequence of strings that represents tokens. If it is a string, the segmentation into tokens will be done using a regular expression based on the value of "SplitPattern".
The output of the encoder is a sequence of integers between 1 and d+1, where d is the number of tokens in the vocabulary. The integer d+1 is used to signify tokens in the input that do not occur in the dictionary.
The type of the output NumericArray is the smallest unsigned integer that can represent all possible output integer values.
An encoder can be attached to an input port of a net by specifying "port"->NetEncoder[…] when constructing the net.
The following parameters can be specified:
"IgnoreCase" True whether to ignore case when matching tokens from the string "SplitPattern" the string pattern to use in order to split the input string into tokens "TargetLength" All the length of the final sequence to crop or pad to
With the parameter "IgnoreCase"->True, tokens are effectively converted to lowercase before encoding.
With the parameter "TargetLength"->All, all tokens found in the input string are encoded.
With the parameter "TargetLength"->n, the first n tokens found in the input string are encoded, with padding applied if fewer than n tokens are found. The padding value is d+1, where d is the number of tokens in the vocabulary.
With the parameter "SplitPattern"->None, the input to the encoder is assumed to be a pre-tokenized list of strings of the form {"token₁","token₂",…}.

Examplesopen all close all Basic Examples (1)

Create a token encoder for English text:

Encode an English sentence:

Out-of-vocabulary words are encoded as the maximum code:

By default, words are detected using a simple regular expression:

The list of words can be explicitly passed using TextElement:

Scope (6)

Use the default token encoder to encode a sentence:

Give a specific list of tokens:

Give a specific list of tokens, including a split pattern:

Specify that the sequence should be padded or trimmed to be 4 elements long:

Use a built-in dictionary for a specific language:

Use a custom tokenization with TextElement:

Use the output of TextStructure to compute a list of token indices:

A tree structure gets flattened:

Parameters (3) "IgnoreCase" (1)

An encoder with "IgnoreCase"->True treats tokens that differ only by the case of their constituent characters as equivalent:

An encoder with "IgnoreCase"->False does not do this:

"SplitPattern" (2)

Create an encoder that isolates digit characters, using "SplitPattern":

The encoder outputs one token for each digit character:

It is different from the default behavior, which gathers all consecutive digit characters together:

Create an encoder with "SplitPattern"->None and two tokens:

The encoder now expects a list of tokens as input:

The encoder still maps across a batch of examples:

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4