Stay organized with collections Save and categorize content based on your preferences.
Matrix factorization is a simple embedding model. Given the feedback matrix A \(\in R^{m \times n}\), where \(m\) is the number of users (or queries) and \(n\) is the number of items, the model learns:
The embeddings are learned such that the product \(U V^T\) is a good approximation of the feedback matrix A. Observe that the \((i, j)\) entry of \(U . V^T\) is simply the dot product \(\langle U_i, V_j\rangle\) of the embeddings of user \(i\) and item \(j\), which you want to be close to \(A_{i, j}\).
Note: Matrix factorization typically gives a more compact representation than learning the full matrix. The full matrix has \(O(nm)\) entries, while the embedding matrices \(U, \ V\) have \(O((n+m)d)\) entries, where the embedding dimension \(d\) is typically much smaller than \(m\) and \(n\). As a result, matrix factorization finds latent structure in the data, assuming that observations lie close to a low-dimensional subspace. In the preceding example, the values of n, m, and d are so low that the advantage is negligible. In real-world recommendation systems, however, matrix factorization can be significantly more compact than learning the full matrix. Choosing the objective functionOne intuitive objective function is the squared distance. To do this, minimize the sum of squared errors over all pairs of observed entries:
\[\min_{U \in \mathbb R^{m \times d},\ V \in \mathbb R^{n \times d}} \sum_{(i, j) \in \text{obs}} (A_{ij} - \langle U_{i}, V_{j} \rangle)^2.\]
In this objective function, you only sum over observed pairs (i, j), that is, over non-zero values in the feedback matrix. However, only summing over values of one is not a good idea—a matrix of all ones will have a minimal loss and produce a model that can't make effective recommendations and that generalizes poorly.
Perhaps you could treat the unobserved values as zero, and sum over all entries in the matrix. This corresponds to minimizing the squared Frobenius distance between \(A\) and its approximation \(U V^T\):
\[\min_{U \in \mathbb R^{m \times d},\ V \in \mathbb R^{n \times d}} \|A - U V^T\|_F^2.\]
You can solve this quadratic problem through Singular Value Decomposition (SVD) of the matrix. However, SVD is not a great solution either, because in real applications, the matrix \(A\) may be very sparse. For example, think of all the videos on YouTube compared to all the videos a particular user has viewed. The solution \(UV^T\) (which corresponds to the model's approximation of the input matrix) will likely be close to zero, leading to poor generalization performance.
In contrast, Weighted Matrix Factorization decomposes the objective into the following two sums:
\[\min_{U \in \mathbb R^{m \times d},\ V \in \mathbb R^{n \times d}} \sum_{(i, j) \in \text{obs}} (A_{ij} - \langle U_{i}, V_{j} \rangle)^2 + w_0 \sum_{(i, j) \not \in \text{obs}} (\langle U_i, V_j\rangle)^2.\]
Here, \(w_0\) is a hyperparameter that weights the two terms so that the objective is not dominated by one or the other. Tuning this hyperparameter is very important.
Note: In practical applications, you also need to weight the observed pairs carefully. For example, frequent items (for example, extremely popular YouTube videos) or frequent queries (for example, heavy users) may dominate the objective function. You can correct for this effect by weighting training examples to account for item frequency. In other words, you can replace the objective function by:\[\sum_{(i, j) \in \text{obs}} w_{i, j} (A_{i, j} - \langle U_i, V_j \rangle)^2 + w_0 \sum_{i, j \not \in \text{obs}} \langle U_i, V_j \rangle^2\]
where \(w_{i, j}\) is a function of the frequency of query i and item j.
Minimizing the objective functionCommon algorithms to minimize the objective function include:
Stochastic gradient descent (SGD) is a generic method to minimize loss functions.
Weighted Alternating Least Squares (WALS) is specialized to this particular objective.
The objective is quadratic in each of the two matrices U and V. (Note, however, that the problem is not jointly convex.) WALS works by initializing the embeddings randomly, then alternating between:
Each stage can be solved exactly (via solution of a linear system) and can be distributed. This technique is guaranteed to converge because each step is guaranteed to decrease the loss.
SGD versus WALSSGD and WALS have advantages and disadvantages. Review the information below to see how they compare:
SGDVery flexible—can use other loss functions.
Can be parallelized.
Slower—does not converge as quickly.
Harder to handle the unobserved entries (need to use negative sampling or gravity).
WALSReliant on Loss Squares only.
Can be parallelized.
Converges faster than SGD.
Easier to handle unobserved entries.
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-02-27 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-02-27 UTC."],[[["Matrix factorization models user-item interactions by learning user and item embeddings, representing them in a lower-dimensional space."],["Weighted Matrix Factorization addresses limitations of basic matrix factorization by incorporating weights for both observed and unobserved interactions to improve generalization."],["The objective function in Weighted Matrix Factorization aims to minimize the difference between predicted and actual user-item interactions, considering both observed and unobserved data with adjustable weights."],["Stochastic Gradient Descent (SGD) and Weighted Alternating Least Squares (WALS) are common algorithms for optimizing the objective function in matrix factorization, each with trade-offs in flexibility, convergence speed, and handling of unobserved entries."]]],[]]
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4