Stay organized with collections Save and categorize content based on your preferences.
The term dimension is a synonym for the number of elements in a feature vector. Some categorical features are low dimensional. For example:
Feature name # of categories Sample categories snowed_today 2 True, False skill_level 3 Beginner, Practitioner, Expert season 4 Winter, Spring, Summer, Autumn day_of_week 7 Monday, Tuesday, Wednesday planet 8 Mercury, Venus, EarthWhen a categorical feature has a low number of possible categories, you can encode it as a vocabulary. With a vocabulary encoding, the model treats each possible categorical value as a separate feature. During training, the model learns different weights for each category.
For example, suppose you are creating a model to predict a car's price based, in part, on a categorical feature named car_color
. Perhaps red cars are worth more than green cars. Since manufacturers offer a limited number of exterior colors, car_color
is a low-dimensional categorical feature. The following illustration suggests a vocabulary (possible values) for car_color
:
True or False: A machine learning model can train directly on raw string values, like "Red" and "Black", without converting these values to numerical vectors.
True
During training, a model can only manipulate floating-point numbers. The string "Red"
is not a floating-point number. You must convert strings like "Red"
to floating-point numbers.
False
A machine learning model can only train on features with floating-point values, so you'll need to convert those strings to floating-point values before training.
Index numbersMachine learning models can only manipulate floating-point numbers. Therefore, you must convert each string to a unique index number, as in the following illustration:
Figure 2. Indexed features.After converting strings to unique index numbers, you'll need to process the data further to represent it in ways that help the model learn meaningful relationships between the values. If the categorical feature data is left as indexed integers and loaded into a model, the model would treat the indexed values as continuous floating-point numbers. The model would then consider "purple" six times more likely than "orange."
One-hot encodingThe next step in building a vocabulary is to convert each index number to its one-hot encoding. In a one-hot encoding:
car_color
has eight possible categories, then the one-hot vector representing will have eight elements.For example, the following table shows the one-hot encoding for each color in car_color
:
It is the one-hot vector, not the string or the index number, that gets passed to the feature vector. The model learns a separate weight for each element of the feature vector.
Note: In a true one-hot encoding, only one element has the value 1.0. In a variant known as multi-hot encoding, multiple values can be 1.0.The following illustration suggests the various transformations in the vocabulary representation:
Figure 3. The end-to-end process to map categories to feature vectors. Sparse representationA feature whose values are predominantly zero (or empty) is termed a sparse feature. Many categorical features, such as car_color
, tend to be sparse features. Sparse representation means storing the position of the 1.0 in a sparse vector. For example, the one-hot vector for "Blue"
is:
[0, 0, 1, 0, 0, 0, 0, 0]
Since the 1
is in position 2 (when starting the count at 0), the sparse representation for the preceding one-hot vector is:
2
Notice that the sparse representation consumes far less memory than the eight-element one-hot vector. Importantly, the model must train on the one-hot vector, not the sparse representation.
Note: The sparse representation of a multi-hot encoding stores the positions of all the nonzero elements. For example, the sparse representation of a car that is both"Blue"
and "Black"
is 2, 5
. Outliers in categorical data
Like numerical data, categorical data also contains outliers. Suppose car_color
contains not only the popular colors, but also some rarely used outlier colors, such as "Mauve"
or "Avocado"
. Rather than giving each of these outlier colors a separate category, you can lump them into a single "catch-all" category called out-of-vocabulary (OOV). In other words, all the outlier colors are binned into a single outlier bucket. The system learns a single weight for that outlier bucket.
Some categorical features have a high number of dimensions, such as those in the following table:
Feature name # of categories Sample categories words_in_english ~500,000 "happy", "walking" US_postal_codes ~42,000 "02114", "90301" last_names_in_Germany ~850,000 "Schmidt", "Schneider"When the number of categories is high, one-hot encoding is usually a bad choice. Embeddings, detailed in a separate Embeddings module, are usually a much better choice. Embeddings substantially reduce the number of dimensions, which benefits models in two important ways:
Hashing (also called the hashing trick) is a less common way to reduce the number of dimensions.
Click here to learn about hashing
In brief, hashing maps a category (for example, a color) to a small integer—the number of the "bucket" that will hold that category.
In detail, you implement a hashing algorithm as follows:
For more details about hashing data, see the Randomization section of the Production machine learning systems module.
Key terms:Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-08-05 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-05 UTC."],[[["Dimension refers to the number of elements in a feature vector, and some categorical features have low dimensionality."],["Machine learning models require numerical input; therefore, categorical data like strings must be converted to numerical representations."],["One-hot encoding transforms categorical values into numerical vectors where each category is represented by a unique element with a value of 1."],["For high-dimensional categorical features with numerous categories, one-hot encoding might be inefficient, and embeddings or hashing are recommended."],["Sparse representation efficiently stores one-hot encoded data by only recording the position of the '1' value to reduce memory usage."]]],[]]
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4