A vector space search involves converting documents into vectors. Each dimension within the vectors represents a term. If a document contains that term then the value within the vector is greater than zero.
Here is an implementation of Vector space searching using python (2.4+).
1 Stemming & Stop wordsFetch all terms within documents and clean – use a stemmer to reduce. A stemmer takes words and tries to reduce them to there base or root. Words which have a common stem often have similar meanings. Example: CONNECTED CONNECTING CONNECTION CONNECTIONS
all map to CONNECT
We also remove any stopwords from the documents. [a,am,an,also,any,and] are all examples of stopwords in English. Stop words have little value in search so we strip them. The stoplist used was taken from: ftp://ftp.cs.cornell.edu/pub/smart/english.stop
self.stemmer = PorterStemmer()
1 2 3 4 5 6 7 8 9 10 11
def removeStopWords(self,list):
""" Remove common words which have no search value """
return [word for word in list if word not in self.stopwords ]
def tokenise(self, string):
""" break string up into tokens and stem words """
string = self.clean(string)
words = string.split(" ")
return [self.stemmer.stem(word,0,len(word)-1) for word in words]
2 Map Keywords to Vector Dimensions
Map the vector dimensions to keywords using a dictionary: keyword=>position
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
def getVectorKeywordIndex(self, documentList):
""" create the keyword associated to the position of the elements within the document vectors """
#Mapped documents into a single word string
vocabularyString = " ".join(documentList)
vocabularyList = self.parser.tokenise(vocabularyString)
#Remove common words which have no search value
vocabularyList = self.parser.removeStopWords(vocabularyList)
uniqueVocabularyList = util.removeDuplicates(vocabularyList)
vectorIndex={}
offset=0
#Associate a position with the keywords which maps to the dimension on the vector used to represent this word
for word in uniqueVocabularyList:
vectorIndex[word]=offset
offset+=1
return vectorIndex #(keyword:position)
3 Map Document Strings to Vectors.
We use the simple Term Count Model. A more accurate model would be to use tf-idf (termFrequency-inverseDocumentFrequnecy).
1 2 3 4 5 6 7 8 9 10
def makeVector(self, wordString):
""" @pre: unique(vectorIndex) """
#Initialise vector with 0's
vector = [0] * len(self.vectorKeywordIndex)
wordList = self.parser.tokenise(wordString)
wordList = self.parser.removeStopWords(wordList)
for word in wordList:
vector[self.vectorKeywordIndex[word]] += 1; #Use simple Term Count Model
return vector
4 Find Related Documents
We now have the ability to find related documents. We can test if two documents are in the concept space by looking at the the cosine of the angle between the document vectors. We use the cosine of the angle as a metric for comparison. If the cosine is 1 then the angle is 0° and hence the vectors are parallel (and the document terms are related). If the cosine is 0 then the angle is 90° and the vectors are perpendicular (and the document terms are not related).
We calculate the cosine between the document vectors in python using scipy.
1 2 3 4
def cosine(vector1, vector2):
""" related documents j and q are in the concept space by comparing the vectors :
cosine = ( V1 * V2 ) / ||V1|| x ||V2|| """
return float(dot(vector1,vector2) / (norm(vector1) * norm(vector2)))
5 Search the Vector Space!
In order to perform a search across keywords we need to map the keywords to the vector space. We create a temporary document which represents the search terms and then we compare it against the document vectors using the same cosine measurement mentioned for relatedness.
1 2 3 4 5 6 7
def search(self,searchList):
""" search for documents that match based on a list of terms """
queryVector = self.buildQueryVector(searchList)
ratings = [util.cosine(queryVector, documentVector) for documentVector in self.documentVectors]
ratings.sort(reverse=True)
return ratings
Further Extensions
Use tf-idf rather than the Term count model for term weightings
Instead of linear processing of all document vectors when searching for related content use: Lanczos methods OR a neural network-like approach.
Moving towards Latent Semantic analysis, Probabilistic latent semantic analysis or Latent Dirichlet allocation.
The stemmer used comes from: http://tartarus.org/~martin/PorterStemmer/python.txt
And the library for performing cosine calculations comes from NumPy: http://www.scipy.org/
Sourcehttps://github.com/josephwilk/semanticpy.git
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4