Latent Semantic Analysis (LSA) is a mathematical method that tries to bring out latent relationships within a collection of documents. Rather than looking at each document isolated from the others it looks at all the documents as a whole and the terms within them to identify relationships.
An example of LSA: Using a search engine search for “sand”.
Documents are returned which do not contain the search term “sand” but contains terms like “beach”.
LSA has identified a latent relationship, “sand” is semantically close to “beach”.
There are some very good papers which describing LSA in detail:
An introduction to LSA: http://lsa.colorado.edu/papers/dp1.LSAintro.pdf
Creating your own LSA space: http://www.andrew.cmu.edu/user/jquesada/pdf/bookSpacesRev1.pdf
Latent Semantic analysis: http://en.wikipedia.org/wiki/Latent_semantic_indexing
This is an implementation of LSA in Python (2.4+). Thanks to scipy its rather simple!
1 Create the term-document matrixWe use the previous work in Vector Space Search to build this matrix.
2 tf-idf TransformApply the tf-idf transform to the term-document matrix. This generally tends to help improve results with LSA.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
def tfidfTransform(self,):
""" Apply TermFrequency(tf)*inverseDocumentFrequency(idf) for each matrix element.
This evaluates how important a word is to a document in a corpus
With a document-term matrix: matrix[x][y]
tf[x][y] = frequency of term y in document x / frequency of all terms in document x
idf[x][y] = log( abs(total number of documents in corpus) / abs(number of documents with term y) )
Note: This is not the only way to calculate tf*idf
"""
documentTotal = len(self.matrix)
rows,cols = self.matrix.shape
for row in xrange(0, rows): #For each document
wordTotal= reduce(lambda x, y: x+y, self.matrix[row] )
for col in xrange(0,cols): #For each term
#For consistency ensure all self.matrix values are floats
self.matrix[row][col] = float(self.matrix[row][col])
if self.matrix[row][col]!=0:
termDocumentOccurences = self.__getTermDocumentOccurences(col)
termFrequency = self.matrix[row][col] / float(wordTotal)
inverseDocumentFrequency = log(abs(documentTotal / float(termDocumentOccurences)))
self.matrix[row][col]=termFrequency*inverseDocumentFrequency
3 Singular Value Decomposition
SVD: http://en.wikipedia.org/wiki/Singular_value_decomposition
Determine U, Sigma, VT from our MATRIX from previous steps.
U . SIGMA . VT = MATRIX
We use the scipy svd implementation here. Note that numpy (version: 1.0.4 ) also has an implementation of svd but I had lots of problems with it. I found it did not work with anything larger than a 2 dimensional matrix.
1 2 3
#Sigma comes out as a list rather than a matrix
u,sigma,vt = linalg.svd(self.matrix)
4 Reduce the dimensions of Sigma
We generally delete the smallest coefficients in the diagonal matrix Sigma to produce Sigma’. The reduction of the dimensions of Sigma combines some dimensions such that they are on more than one term. The number of coefficients deleted can depend of the corpus used. It should be large enough to fit the real structure in the data, but small enough such that noise or unimportant details are not modelled.
The real difficulty and weakness of LSA is knowing how many dimensions to remove. There is no exact method of finding the right dimensions. Generally L2-norm or Frobenius norm are used.
5 Calculate the Product with New Sigma’Finally we calculate:
U . SIGMA' . VT = MATRIX'
1 2
#Reconstruct MATRIX'
reconstructedMatrix= dot(dot(u,linalg.diagsvd(sigma,len(self.matrix),len(vt))),vt)
Giving use our final LSA matrix. We can now apply the same functionality used in Vector space search: searching and finding related scores for documents.
LSA In Action – MatricesWe start with out Model-Term frequency matrix with is generated from creating a Vector Space Search with four documents (D1-D4): D1:“The cat in the hat disabled” D2:“A cat is a fine pet ponies.” D3:“Dogs and cats make good pets” D4:“I haven’t got a hat.”
<strong>good pet hat make dog cat poni fine disabl</strong>
D1 [+0.00 +0.00 +1.00 +0.00 +0.00 +1.00 +0.00 +0.00 +1.00 ]
D2 [+0.00 +1.00 +0.00 +0.00 +0.00 +1.00 +1.00 +1.00 +0.00 ]
D3 [+1.00 +1.00 +0.00 +1.00 +1.00 +1.00 +0.00 +0.00 +0.00 ]
D4 [+0.00 +0.00 +1.00 +0.00 +0.00 +0.00 +0.00 +0.00 +0.00 ]
Apply tf-idf transform:
D1 [+0.00 +0.00 +0.23 +0.00 +0.00 +0.10 +0.00 +0.00 +0.46 ]
D2 [+0.00 +0.17 +0.00 +0.00 +0.00 +0.07 +0.35 +0.35 +0.00 ]
D3 [+0.28 +0.14 +0.00 +0.28 +0.28 +0.06 +0.00 +0.00 +0.00 ]
D4 [+0.00 +0.00 +0.69 +0.00 +0.00 +0.00 +0.00 +0.00 <span style="color: #ff0000;"><strong>+0.00</strong></span> ]
Perform SVD – Reduce Sigmas dimensions(removing the 2 smallest coefficients)
D1 [+0.01 +0.01 +0.34 +0.01 +0.01 +0.03 +0.02 +0.02 +0.11 ]
D2 [-0.00 +0.17 -0.01 -0.00 -0.00 +0.08 +0.35 +0.35 +0.02 ]
D3 [+0.28 +0.14 -0.01 +0.28 +0.28 +0.06 -0.00 -0.00 +0.02 ]
D4 [-0.01 -0.01 +0.63 -0.01 -0.01 +0.04 -0.01 -0.01 <strong><span style="color: #ff0000;">+0.19</span></strong> ]
Note the Word ‘disabl’ despite not being in D4 now has a weighting in that document.
Dependencies ProblemsLSA assumes the Normal distribution where the Poisson distribution has actually been observed.
Source CodeAvailable at github:
git clone git://github.com/josephwilk/semanticpy.git
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4