We deliver solutions for the AI eraâcombining symbolic computation, data-driven insights and deep technical expertise
NetEncoder["Tokens"]
represents an encoder that converts the words in a string to a sequence of integer codes using a standard English vocabulary.
NetEncoder[{"Tokens","language"}]
represents an encoder that uses a standard vocabulary for the given language.
NetEncoder[{"Tokens",{token1,token2,…}}]
represents an encoder that uses a specified list of tokens as the vocabulary.
NetEncoder[{"Tokens",…,"param"value}]
represents an encoder in which additional parameters have been specified.
DetailsCreate a token encoder for English text:
Out-of-vocabulary words are encoded as the maximum code:
By default, words are detected using a simple regular expression:
The list of words can be explicitly passed using TextElement:
Scope (6)Use the default token encoder to encode a sentence:
Give a specific list of tokens:
Give a specific list of tokens, including a split pattern:
Specify that the sequence should be padded or trimmed to be 4 elements long:
Use a built-in dictionary for a specific language:
Use a custom tokenization with TextElement:
Use the output of TextStructure to compute a list of token indices:
A tree structure gets flattened:
Parameters (3) "IgnoreCase" (1)An encoder with "IgnoreCase"->True treats tokens that differ only by the case of their constituent characters as equivalent:
An encoder with "IgnoreCase"->False does not do this:
"SplitPattern" (2)Create an encoder that isolates digit characters, using "SplitPattern":
The encoder outputs one token for each digit character:
It is different from the default behavior, which gathers all consecutive digit characters together:
Create an encoder with "SplitPattern"->None and two tokens:
The encoder now expects a list of tokens as input:
The encoder still maps across a batch of examples:
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4