It is common for users to build RDF applications that combine some form of âkeyword searchâ with their queries. For example, a user might want to retrieve all triples for which the string âAliceâ appears as a word within the third (object) field of the triple. AllegroGraph provides a capability for including free text matching within a SPARQL query, and also by using the evalFreeTextSearch()
method of the connection object. It requires, however, that you create and configure indexes appropriate to the searches you want to pursue.
First letâs open a connection
from franz.openrdf.connect import ag_connect conn = ag_connect("python-tutorial", create=True, clear=True)
We will start this example by importing some sample data
conn.addData(""" @prefix : <ex://> . :alice a :Person ; :fullname "Alice B. Toklas" . :book1 a :Book ; :title "Alice in Wonderland" ; :author :carroll . :carroll a :Person ; :fullname "Lewis Carroll" .""")
We have to create an index. AllegroGraph lets you create any number of text indexes, each for a specific purpose. In this case we are indexing the literal values we find in the fullname
predicate, which we have used in resources that describe people. The createFreeTextIndex()
method has many configurable parameters. Their default settings are appropriate to this situation. All we have to provide is a name for the index and the URI of the predicate (or predicates) that contain the text to be indexed.
fullname = conn.createURI(namespace='ex://', localname='fullname') conn.createFreeTextIndex( "index1", predicates=[fullname])
We can view the index configuration using the getFreeTextIndexConfiguration()
method:
config = conn.getFreeTextIndexConfiguration("index1") for key, value in config.items(): if isinstance(value, list): value = ', '.join(str(x) for x in value) print('{key}: {value}'.format(key=key, value=value))
tokenizer: default graphs: types: indexLiterals: True minimumWordSize: 3 indexFields: object stopWords: ... innerChars: predicates: <ex://fullname> wordFilters: indexResources: False borderChars:
This configuration says that index1
will operate on the literal values it finds in the object position of the <ex://fullname>
predicate. It ignores words smaller than three characters in length. It will ignore the words in its stopWords
list (elided from sample output). If it encounters a resource URI in the object position, it will ignore it. This index doesnât use any wordFilters
, which are sometimes used to remove accented letters and to perform stemming on indexed text and search strings.
The text match occurs through a âmagicâ predicate called fti:match. This predicate has two arguments. One is the subject URI of the resources to search. The other is the string pattern to search for, such as âAliceâ. Only full-word matches will be found.
query = conn.prepareTupleQuery(query=""" SELECT ?s WHERE { ?s fti:match "Alice" . }""") query.evaluate(output=True)
There is no need to include a prefix declaration for the fti
namespace. That is because fti
is included among the built-in namespace mappings in AllegroGraph.
When we execute our SPARQL query, it matches the "Alice"
within the literal "Alice B. Toklas"
because that literal occurs in a triple having the fullname
predicate, but it does not match the âAliceâ in the literal "Alice in Wonderland"
because the title
predicate was not included in our index.
-------------- | s | ============== | ex://alice | --------------
By default fti:match
searches in all text indexes. It is possible to specify a single index name when searching. Weâll illustrate this be creating another index, this time on the title
predicate:
title = conn.createURI(namespace='ex://', localname='title') conn.createFreeTextIndex( "index2", predicates=[title]) query = conn.prepareTupleQuery(query=""" SELECT ?s WHERE { ?s fti:match ( "Alice" "index2" ) . }""") query.evaluate(output=True)
This time only the book title will match our query
-------------- | s | ============== | ex://book1 | --------------
Another way of searching text indexes is the evalFreeTextSearch()
method:
for triple in conn.evalFreeTextSearch( "Alice", index="index1"): print(triple[0])
This works just like our first query. Note that evalFreeTextSearch()
returns a list of lists of strings (in N-Triples format), not a list of Statement
objects.
The text index supports simple wildcard queries. The asterisk (*
) may be appended to the end of the pattern to indicate âany number of additional characters.â For instance, this query looks for whole words that begin with âAliâ:
for triple in conn.evalFreeTextSearch("Ali*"): print(triple[0])
This search runs across both indexes, so it will find both the :title
and the :fullname
triples.
<ex://alice> <ex://book1>
There is also a single-character wildcard, the question mark. It will match any single character. You can add as many question marks as you need to the string pattern. This query looks for a five-letter word that has âlâ in the second position, and âcâ in the fourth position:
for triple in conn.evalFreeTextSearch("?l?c?*"): print(triple[0])
The result is the same as for the previous query
<ex://alice> <ex://book1>
Text indexes are not the only way of matching text values available in SPARQL. One may also filter results using regular expressions. This approach is more flexible, but at the price of performance. Regular expression filters do not use any form of indexing to speed up the query.
query = conn.prepareTupleQuery(query=""" SELECT ?s ?p ?o WHERE { ?s ?p ?o . FILTER regex(?o, "lic|oll") }""") query.evaluate(output=True)
Note how this search matches the provided pattern inside words.
------------------------------------------------------ | s | p | o | ====================================================== | ex://carroll | ex://fullname | Lewis Carroll | | ex://book1 | ex://title | Alice in Wonderland | | ex://alice | ex://fullname | Alice B. Toklas | ------------------------------------------------------
In addition to indexing literal values, AllegroGraph can also index resource URIs. index3
is an index that looks for URIs in the object position of the author
predicate, and then indexes only the local name of the resource (the characters following the rightmost /
, #
or :
in the URI). This lets us avoid indexing highly-repetitive namespace strings, which would fill the index with data that would not be very useful.
author = conn.createURI(namespace='ex://', localname='author') conn.createFreeTextIndex( "index3", predicates=[author], indexResources="short", indexFields=["object"]) for triple in conn.evalFreeTextSearch("carroll", index="index3"): print(triple[0])
The text search located the triple that has carroll
in the URI in the object position:
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4