The usage of decompounds can lead to undesired results regarding phrase queries. After indexing, decompound tokens ca not be distinguished from original tokens. The outcome of a phrase query "Deutsche Bank" could be Deutsche Spielbankgesellschaft
, what is clearly an unexpected result. To enable "exact" phrase queries, each decoumpound token is tagged with additional payload data.
To evaluate this payload data, you can use the exact_phrase
as a wrapper around a query containing your phrase queries.
use_payload
- if set to true, enable payload creation. Default: false
# Langdetect
curl -XDELETE 'localhost:9200/test'
curl -XPUT 'localhost:9200/test'
curl -XPOST 'localhost:9200/test/article/_mapping' -d '
{
"article" : {
"properties" : {
"content" : { "type" : "langdetect" }
}
}
}
'
curl -XPUT 'localhost:9200/test/article/1' -d '
{
"title" : "Some title",
"content" : "Oh, say can you see by the dawn`s early light, What so proudly we hailed at the twilight`s last gleaming?"
}
'
curl -XPUT 'localhost:9200/test/article/2' -d '
{
"title" : "Ein Titel",
"content" : "Einigkeit und Recht und Freiheit für das deutsche Vaterland!"
}
'
curl -XPUT 'localhost:9200/test/article/3' -d '
{
"title" : "Un titre",
"content" : "Allons enfants de la Patrie, Le jour de gloire est arrivé!"
}
'
curl -XGET 'localhost:9200/test/_refresh'
curl -XPOST 'localhost:9200/test/_search' -d '
{
"query" : {
"term" : {
"content" : "en"
}
}
}
'
curl -XPOST 'localhost:9200/test/_search' -d '
{
"query" : {
"term" : {
"content" : "de"
}
}
}
'
curl -XPOST 'localhost:9200/test/_search' -d '
{
"query" : {
"term" : {
"content" : "fr"
}
}
}
'
# Standardnumber
Try it out
----
GET _analyze
{
"tokenizer": "standard",
"filter": [
{
"type": "standardnumber"
}
],
"text": "Die ISBN von Elasticsearch in Action lautet 9781617291623"
}
----
{
"index" : {
"analysis" : {
"filter" : {
"standardnumber" : {
"type" : "standardnumber"
}
},
"analyzer" : {
"standardnumber" : {
"tokenizer" : "whitespace",
"filter" : [ "standardnumber", "unique" ]
}
}
}
}
}
- WordDelimiterFilter2: taken from Lucene
- baseform: index also base forms of words (german, english)
- decompound: decompose words if possible (german)
- langdetect: find language code of detected languages
- standardnumber: standard number entity recognition
- hyphen: token filter for shingling and combining hyphenated words (german: Bindestrichwörter), the opposite of the decompound token filter
- sortform: process string forms for bibliographical sorting, taking non-sort areas into account
- year: token filter for 4-digit sequences
- reference:
## Crypt mapper
{
"someType" : {
"_source" : {
"enabled": false
},
"properties" : {
"someField":{ "type" : "crypt", "algo": "SHA-512" }
}
}
}
## Issues
All feedback is welcome! If you find issues, please post them at [Github](https://github.com/jprante/elasticsearch-plugin-bundle/issues)
# References
The decompunder is a derived work of ASV toolbox http://asv.informatik.uni-leipzig.de/asv/methoden
Copyright (C) 2005 Abteilung Automatische Sprachverarbeitung, Institut für Informatik, Universität Leipzig
The Compact Patricia Trie data structure can be found in
*Morrison, D.: Patricia - practical algorithm to retrieve information coded in alphanumeric. Journal of ACM, 1968, 15(4):514–534*
The compound splitter used for generating features for document classification is described in
*Witschel, F., Biemann, C.: Rigorous dimensionality reduction through linguistically motivated feature selection for text categorization. Proceedings of NODALIDA 2005, Joensuu, Finland*
The base form reduction step (for Norwegian) is described in
*Eiken, U.C., Liseth, A.T., Richter, M., Witschel, F. and Biemann, C.: Ord i Dag: Mining Norwegian Daily Newswire. Proceedings of FinTAL, Turku, 2006, Finland*
# License
elasticsearch-plugin-bundle - a compilation of useful plugins for Elasticsearch
Copyright (C) 2014 Jörg Prante
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4