Last Updated : 23 Jul, 2025
In natural language processing order of words is very important for understanding its meaning in the tasks like translation and text generation. Transformers process all tokens in parallel which speeds up training but they don’t naturally capture order of tokens. To address this issue positional encoding was introduced which adds information about each token's position in the sequence which helps model to understand relationships and order between tokens. In this article, we will see more about Positional Encoding and its core concepts.
Example of Positional EncodingSuppose we have a Transformer model which translates English sentences into French.
"The cat sat on the mat."
Before the sentence is fed into the Transformer model it gets tokenized where each word is converted into a token. Let's assume the tokens for this sentence are:
["The", "cat" , "sat", "on", "the" ,"mat"]
After that each token is mapped to a high-dimensional vector representation through an embedding layer. These embeddings encode semantic information about the words in the sentence. However they lack information about the order of the words.
Embeddings ={ E1,E2,E3,E4,E5,E6 }
where each Ei is a 4-dimensional vector. This is where positional encoding plays an important role. To help the model to understand the order of words in a sequence these are added to the word embeddings and they assign each token a unique representation based on its position in the sequence.
Calculating Positional Encodings\text{PE}(1) = \left[\sin\left(\frac{1}{10000^{2 \times 0/4}}\right), \cos\left(\frac{1}{10000^{2 \times 0/4}}\right), \sin\left(\frac{1}{10000^{2 \times 1/4}}\right), \cos\left(\frac{1}{10000^{2 \times 1/4}}\right)\right] \\ \text{PE}(2) = \left[\sin\left(\frac{2}{10000^{2 \times 0/4}}\right), \cos\left(\frac{2}{10000^{2 \times 0/4}}\right), \sin\left(\frac{2}{10000^{2 \times 1/4}}\right), \cos\left(\frac{2}{10000^{2 \times 1/4}}\right)\right] \\ \dots \\ \text{PE}(6) = \left[\sin\left(\frac{6}{10000^{2 \times 0/4}}\right), \cos\left(\frac{6}{10000^{2 \times 0/4}}\right), \sin\left(\frac{6}{10000^{2 \times 1/4}}\right), \cos\left(\frac{6}{10000^{2 \times 1/4}}\right)\right]
Positional Encoding layer is important in Transformer as it provides positional information to the model. Since Transformers process sequences in parallel and don’t have a built-in understanding of token order it helps the model to capture the sequence’s structure.
Using a mathematical formula, it generates a unique representation for each token's position in the sequence helps in allowing the Transformer to understand token order while processing in parallel.
Formula for Positional Encoding: For each position p in the sequence and for each dimension 2i and 2i+1 in the encoding vector:
These formulas use sine and cosine functions to create wave-like patterns that changes across the sequence positions. Using sine for even indices and cosine for odd indices helps in getting a combination of features that can effectively represent positional information across different sequence lengths.
Implementation of Positional Encoding in TransformersHere we will be using Numpy and Tensorflow libraries for its implementations.
import numpy as np
import tensorflow as tf
def positional_encoding(position, d_model):
angle_rads = np.arange(position)[:, np.newaxis] / np.power(10000, (2 * (np.arange(d_model)[np.newaxis, :] // 2)) / np.float32(d_model))
angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
pos_encoding = angle_rads[np.newaxis, ...]
return tf.cast(pos_encoding, dtype=tf.float32)
position = 50
d_model = 512
pos_encoding = positional_encoding(position, d_model)
print(pos_encoding.shape)
Output:
(1, 50, 512)
Array provided is the positional encodings generated by the positional_encoding function for a sequence of length 10 and a model dimensionality of 512. Each row in the array corresponds to a position in the sequence and each column represents a dimension in the model.
Importance of Positional EncodingPositional encodings are important in Transformer models for several reasons:
By adding important positional information, positional encodings allow Transformer models to understand the relationships and order of tokens which ensures it processes sequential data while parallel processing.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4