RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://aws.amazon.com/what-is/transformers-in-artificial-intelligence/ below:

What are Transformers? - Transformers in Artificial Intelligence Explained

Transformer neural network architecture has several software layers that work together to generate the final output. The following image shows the components of transformation architecture, as explained in the rest of this section.

Input embeddings

This stage converts the input sequence into the mathematical domain that software algorithms understand. At first, the input sequence is broken down into a series of tokens or individual sequence components. For instance, if the input is a sentence, the tokens are words. Embedding then transforms the token sequence into a mathematical vector sequence. The vectors carry semantic and syntax information, represented as numbers, and their attributes are learned during the training process.

You can visualize vectors as a series of coordinates in an n-dimensional space. As a simple example, think of a two-dimensional graph, where x represents the alphanumeric value of the first letter of the word and y represents their categories. The word banana has the value (2,2) because it starts with the letter b and is in the category fruit. The word mango has the value (13,2) because it starts with the letter m and is also in the category fruit. In this way, the vector (x,y) tells the neural network that the words banana and mango are in the same category.

Now imagine an n-dimensional space with thousands of attributes about any word's grammar, meaning, and use in sentences mapped to a series of numbers. Software can use the numbers to calculate the relationships between words in mathematical terms and understand the human language model. Embeddings provide a way to represent discrete tokens as continuous vectors that the model can process and learn from.

Positional encoding

Positional encoding is a crucial component in the transformer architecture because the model itself doesn’t inherently process sequential data in order. The transformer needs a way to consider the order of the tokens in the input sequence. Positional encoding adds information to each token's embedding to indicate its position in the sequence. This is often done by using a set of functions that generate a unique positional signal that is added to the embedding of each token. With positional encoding, the model can preserve the order of the tokens and understand the sequence context.

Transformer block

A typical transformer model has multiple transformer blocks stacked together. Each transformer block has two main components: a multi-head self-attention mechanism and a position-wise feed-forward neural network. The self-attention mechanism enables the model to weigh the importance of different tokens within the sequence. It focuses on relevant parts of the input when making predictions.

For instance, consider the sentences "Speak no lies" and "He lies down." In both sentences, the meaning of the word lies can’t be understood without looking at the words next to it. The words speak and down are essential to understand the correct meaning. Self-attention enables the grouping of relevant tokens for context.

The feed-forward layer has additional components that help the transformer model train and function more efficiently. For example, each transformer block includes:

Connections around the two main components that act like shortcuts. They enable the flow of information from one part of the network to another, skipping certain operations in between.
Layer normalization that keeps the numbers—specifically the outputs of different layers in the network—inside a certain range so that the model trains smoothly.
Linear transformation functions so that the model adjusts values to better perform the task it's being trained on—like document summary as opposed to translation.

Linear and softmax blocks

Ultimately the model needs to make a concrete prediction, such as choosing the next word in a sequence. This is where the linear block comes in. It’s another fully connected layer, also known as a dense layer, before the final stage. It performs a learned linear mapping from the vector space to the original input domain. This crucial layer is where the decision-making part of the model takes the complex internal representations and turns them back into specific predictions that you can interpret and use. The output of this layer is a set of scores (often called logits) for each possible token.

The softmax function is the final stage that takes the logit scores and normalizes them into a probability distribution. Each element of the softmax output represents the model's confidence in a particular class or token.

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4