analysis-sudachi is an Elasticsearch plugin for tokenization of Japanese text using Sudachi the Japanese morphological analyzer.
allow_empty_morpheme
is added to the sudachi_tokenizer
settings (#151)SudachiSplitFilter
now works properly with char filters (#149)Check changelog for more.
$ ./gradlew -PengineVersion=es:8.15.2 build
Use -PengineVersion=os:2.18.0
for OpenSearch.
Move current dir to $ES_HOME
Install the Plugin
a. Using the release package
$ bin/elasticsearch-plugin install https://github.com/WorksApplications/elasticsearch-sudachi/releases/download/v3.1.1/analysis-sudachi-8.13.4-3.1.1.zip
b. Using self-build package
$ bin/elasticsearch-plugin install file:///path/to/analysis-sudachi-8.13.4-3.1.1.zip
(Specify the absolute path in URI format)
Download sudachi dictionary archive from https://github.com/WorksApplications/SudachiDict
Extract dic file and place it to config/sudachi/system_core.dic (You must install system_core.dic in this place if you use Elasticsearch 7.6 or later)
Execute "bin/elasticsearch"
If you want to update Sudachi that is included in a plugin you have installed, do the following
An analyzer sudachi
is provided. This is equivalent to the following custom analyzer.
{ "settings": { "index": { "analysis": { "analyzer": { "default_sudachi_analyzer": { "type": "custom", "tokenizer": "sudachi_tokenizer", "filter": [ "sudachi_baseform", "sudachi_part_of_speech", "sudachi_ja_stop" ] } } } } } }
See following sections for the detail of the tokenizer and each filters.
The sudachi_tokenizer
tokenizer tokenizes input texts using Sudachi.
settings_path
will be overridden.By default, ES_HOME/config/sudachi/sudachi_core.dic
is used. You can specify the dictionary either in the file specified by settings_path
or by additional_settings
. Due to the security manager, you need to put resources (setting file, dictionaries, and others) under the elasticsearch config directory.
tokenizer configuration
{ "settings": { "index": { "analysis": { "tokenizer": { "sudachi_tokenizer": { "type": "sudachi_tokenizer", "split_mode": "C", "discard_punctuation": true, "resources_path": "/etc/elasticsearch/config/sudachi" } }, "analyzer": { "sudachi_analyzer": { "type": "custom", "tokenizer": "sudachi_tokenizer" } } } } } }
dictionary settings
{ "settings": { "index": { "analysis": { "tokenizer": { "sudachi_tokenizer": { "type": "sudachi_tokenizer", "additional_settings": "{\"systemDict\":\"system_full.dic\",\"userDict\":[\"user.dic\"]}" } }, "analyzer": { "sudachi_analyzer": { "type": "custom", "tokenizer": "sudachi_tokenizer" } } } } } }
The sudachi_split
token filter works like mode
of kuromoji.
Note: In search query, split subwords are handled as a phrase (in the same way to multi-word synonyms). If you want to search with both A/C unit, use multiple tokenizers instead.
{ "settings": { "index": { "analysis": { "tokenizer": { "sudachi_tokenizer": { "type": "sudachi_tokenizer" } }, "analyzer": { "sudachi_analyzer": { "filter": ["my_searchfilter"], "tokenizer": "sudachi_tokenizer", "type": "custom" } }, "filter":{ "my_searchfilter": { "type": "sudachi_split", "mode": "search" } } } } } }POST sudachi_sample/_analyze
{ "analyzer": "sudachi_analyzer", "text": "関西国際空港" }
Which responds with:
{ "tokens" : [ { "token" : "関西国際空港", "start_offset" : 0, "end_offset" : 6, "type" : "word", "position" : 0, "positionLength" : 3 }, { "token" : "関西", "start_offset" : 0, "end_offset" : 2, "type" : "word", "position" : 0 }, { "token" : "国際", "start_offset" : 2, "end_offset" : 4, "type" : "word", "position" : 1 }, { "token" : "空港", "start_offset" : 4, "end_offset" : 6, "type" : "word", "position" : 2 } ] }
The sudachi_part_of_speech
token filter removes tokens that match a set of part-of-speech tags. It accepts the following setting:
The stoptags
is an array of part-of-speech and/or inflection tags that should be removed. It defaults to the stoptags.txt file embedded in the lucene-analysis-sudachi.jar.
Sudachi POS information is a csv list, consisting 6 items;
part-of-speech hierarchy (品詞階層)
inflectional type (活用型)
inflectional form (活用形)
With the stoptags
, you can filter out the result in any of these forward matching forms;
名詞
名詞,固有名詞
名詞,固有名詞,地名
名詞,固有名詞,地名,一般
五段-カ行
終止形-一般
五段-カ行,終止形-一般
{ "settings": { "index": { "analysis": { "tokenizer": { "sudachi_tokenizer": { "type": "sudachi_tokenizer" } }, "analyzer": { "sudachi_analyzer": { "filter": ["my_posfilter"], "tokenizer": "sudachi_tokenizer", "type": "custom" } }, "filter":{ "my_posfilter":{ "type":"sudachi_part_of_speech", "stoptags":[ "助詞", "助動詞", "補助記号,句点", "補助記号,読点" ] } } } } } }POST sudachi_sample/_analyze
{ "analyzer": "sudachi_analyzer", "text": "寿司がおいしいね" }
Which responds with:
{ "tokens": [ { "token": "寿司", "start_offset": 0, "end_offset": 2, "type": "word", "position": 0 }, { "token": "おいしい", "start_offset": 3, "end_offset": 7, "type": "word", "position": 2 } ] }
The sudachi_ja_stop
token filter filters out Japanese stopwords (japanese), and any other custom stopwords specified by the user. This filter only supports the predefined japanese stopwords list. If you want to use a different predefined list, then use the stop token filter instead.
{ "settings": { "index": { "analysis": { "tokenizer": { "sudachi_tokenizer": { "type": "sudachi_tokenizer" } }, "analyzer": { "sudachi_analyzer": { "filter": ["my_stopfilter"], "tokenizer": "sudachi_tokenizer", "type": "custom" } }, "filter":{ "my_stopfilter":{ "type":"sudachi_ja_stop", "stopwords":[ "_japanese_", "は", "です" ] } } } } } }POST sudachi_sample/_analyze
{ "analyzer": "sudachi_analyzer", "text": "私は宇宙人です。" }
Which responds with:
{ "tokens": [ { "token": "私", "start_offset": 0, "end_offset": 1, "type": "word", "position": 0 }, { "token": "宇宙", "start_offset": 2, "end_offset": 4, "type": "word", "position": 2 }, { "token": "人", "start_offset": 4, "end_offset": 5, "type": "word", "position": 3 } ] }
The sudachi_baseform
token filter replaces terms with their Sudachi dictionary form. This acts as a lemmatizer for verbs and adjectives.
This will be overridden by sudachi_split
, sudachi_normalizedform
or sudachi_readingform
token filters.
{ "settings": { "index": { "analysis": { "tokenizer": { "sudachi_tokenizer": { "type": "sudachi_tokenizer" } }, "analyzer": { "sudachi_analyzer": { "filter": ["sudachi_baseform"], "tokenizer": "sudachi_tokenizer", "type": "custom" } } } } } }POST sudachi_sample/_analyze
{ "analyzer": "sudachi_analyzer", "text": "飲み" }
Which responds with:
{ "tokens": [ { "token": "飲む", "start_offset": 0, "end_offset": 2, "type": "word", "position": 0 } ] }
The sudachi_normalizedform
token filter replaces terms with their Sudachi normalized form. This acts as a normalizer for spelling variants. This filter lemmatizes verbs and adjectives too. You don't need to use sudachi_baseform
filter with this filter.
This will be overridden by sudachi_split
, sudachi_baseform
or sudachi_readingform
token filters.
{ "settings": { "index": { "analysis": { "tokenizer": { "sudachi_tokenizer": { "type": "sudachi_tokenizer" } }, "analyzer": { "sudachi_analyzer": { "filter": ["sudachi_normalizedform"], "tokenizer": "sudachi_tokenizer", "type": "custom" } } } } } }POST sudachi_sample/_analyze
{ "analyzer": "sudachi_analyzer", "text": "呑み" }
Which responds with:
{ "tokens": [ { "token": "飲む", "start_offset": 0, "end_offset": 2, "type": "word", "position": 0 } ] }
The sudachi_readingform
token filter replaces the terms with their reading form in either katakana or romaji.
This will be overridden by sudachi_split
, sudachi_baseform
or sudachi_normalizedform
token filters.
Accepts the following setting:
{ "settings": { "index": { "analysis": { "filter": { "romaji_readingform": { "type": "sudachi_readingform", "use_romaji": true }, "katakana_readingform": { "type": "sudachi_readingform", "use_romaji": false } }, "tokenizer": { "sudachi_tokenizer": { "type": "sudachi_tokenizer" } }, "analyzer": { "romaji_analyzer": { "tokenizer": "sudachi_tokenizer", "filter": ["romaji_readingform"] }, "katakana_analyzer": { "tokenizer": "sudachi_tokenizer", "filter": ["katakana_readingform"] } } } } } }POST sudachi_sample/_analyze
{ "analyzer": "katakana_analyzer", "text": "寿司" }
Returns スシ
.
{ "analyzer": "romaji_analyzer", "text": "寿司" }
Returns susi
.
There is a temporary way to use Sudachi Dictionary's synonym resource (Sudachi 同義語辞書) with Elasticsearch.
Please refer to this document for the detail.
Copyright (c) 2017-2024 Works Applications Co., Ltd. Originally under elasticsearch, https://www.elastic.co/jp/products/elasticsearch Originally under lucene, https://lucene.apache.org/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4