RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://deeplearning4j.konduit.ai/deeplearning4j/tutorials/language-processing/tokenization below:

Tokenization | Deeplearning4j

Tokenization

Breaking text into individual words for language processing in DL4J.

Notes to write on: 1. Tokenizer factory interface 2. Tokenizer interface 2. How to write your own factory and tokenizer

Tokenization is the process of breaking text down into individual words. Word windows are also composed of tokens. Word2Vec can output text windows that comprise training examples for input into neural nets, as seen here.

Here's an example of tokenization done with DL4J tools:

TokenizerFactory tokenizerFactory = new DefaultTokenizerFactory();
Tokenizer tokenizer = tokenizerFactory.tokenize("mystring");

//iterate over the tokens
while(tokenizer.hasMoreTokens()) {
      String token = tokenizer.nextToken();
}

//get the whole list of tokens
List<String> tokens = tokenizer.getTokens();

The above snippet creates a tokenizer capable of stemming.

In Word2Vec, that's the recommended a way of creating a vocabulary, because it averts various vocabulary quirks, such as the singular and plural of the same noun being counted as two different words.

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4