SudachiPy is a Python version of Sudachi, a Japanese morphological analyzer.
This repository is for 0.5.* version of SudachiPy, 0.6* and above are developed as Sudachi.rs.
$ pip install sudachipy sudachidict_core $ echo "高輪ゲートウェイ駅" | sudachipy 高輪ゲートウェイ駅 名詞,固有名詞,一般,*,*,* 高輪ゲートウェイ駅 EOS $ echo "高輪ゲートウェイ駅" | sudachipy -m A 高輪 名詞,固有名詞,地名,一般,*,* 高輪 ゲートウェイ 名詞,普通名詞,一般,*,*,* ゲートウェー 駅 名詞,普通名詞,一般,*,*,* 駅 EOS $ echo "空缶空罐空きカン" | sudachipy -a 空缶 名詞,普通名詞,一般,*,*,* 空き缶 空缶 アキカン 0 空罐 名詞,普通名詞,一般,*,*,* 空き缶 空罐 アキカン 0 空きカン 名詞,普通名詞,一般,*,*,* 空き缶 空きカン アキカン 0 EOS
You need SudachiPy and a dictionary.
Step 1. Install SudachiPyYou can get dictionary as a Python package. It make take a while to download the dictionary file (around 70MB for the core
edition).
$ pip install sudachidict_core
Alternatively, you can choose other dictionary editions. See this section for the detail.
There is a CLI command sudachipy
.
$ echo "外国人参政権" | sudachipy 外国人参政権 名詞,普通名詞,一般,*,*,* 外国人参政権 EOS $ echo "外国人参政権" | sudachipy -m A 外国 名詞,普通名詞,一般,*,*,* 外国 人 接尾辞,名詞的,一般,*,*,* 人 参政 名詞,普通名詞,一般,*,*,* 参政 権 接尾辞,名詞的,一般,*,*,* 権 EOS
$ sudachipy tokenize -h usage: sudachipy tokenize [-h] [-r file] [-m {A,B,C}] [-o file] [-s string] [-a] [-d] [-v] [file [file ...]] Tokenize Text positional arguments: file text written in utf-8 optional arguments: -h, --help show this help message and exit -r file the setting file in JSON format -m {A,B,C} the mode of splitting -o file the output file -s string sudachidict type -a print all of the fields -d print the debug information -v, --version print sudachipy version
Columns are tab separated.
When you add the -a
option, it additionally outputs
0
for the system dictionary1
and above for the user dictionaries-1\t(OOV)
if a word is Out-of-Vocabulary (not in the dictionary)$ echo "外国人参政権" | sudachipy -a 外国人参政権 名詞,普通名詞,一般,*,*,* 外国人参政権 外国人参政権 ガイコクジンサンセイケン 0 EOS
echo "阿quei" | sudachipy -a 阿 名詞,普通名詞,一般,*,*,* 阿 阿 -1 (OOV) quei 名詞,普通名詞,一般,*,*,* quei quei -1 (OOV) EOSUsage: As a Python package
Here is an example;
from sudachipy import tokenizer from sudachipy import dictionary tokenizer_obj = dictionary.Dictionary().create()
# Multi-granular Tokenization mode = tokenizer.Tokenizer.SplitMode.C [m.surface() for m in tokenizer_obj.tokenize("国家公務員", mode)] # => ['国家公務員'] mode = tokenizer.Tokenizer.SplitMode.B [m.surface() for m in tokenizer_obj.tokenize("国家公務員", mode)] # => ['国家', '公務員'] mode = tokenizer.Tokenizer.SplitMode.A [m.surface() for m in tokenizer_obj.tokenize("国家公務員", mode)] # => ['国家', '公務', '員']
# Morpheme information m = tokenizer_obj.tokenize("食べ", mode)[0] m.surface() # => '食べ' m.dictionary_form() # => '食べる' m.reading_form() # => 'タベ' m.part_of_speech() # => ['動詞', '一般', '*', '*', '下一段-バ行', '連用形-一般']
# Normalization tokenizer_obj.tokenize("附属", mode)[0].normalized_form() # => '付属' tokenizer_obj.tokenize("SUMMER", mode)[0].normalized_form() # => 'サマー' tokenizer_obj.tokenize("シュミレーション", mode)[0].normalized_form() # => 'シミュレーション'
(With 20200330
core
dictionary. The results may change when you use other versions)
**WARNING: sudachipy link
is no longer available in SudachiPy v0.5.2 and later. **
There are three editions of Sudachi Dictionary, namely, small
, core
, and full
. See WorksApplications/SudachiDict for the detail.
SudachiPy uses sudachidict_core
by default.
Dictionaries are installed as Python packages sudachidict_small
, sudachidict_core
, and sudachidict_full
.
The dictionary files are not in the package itself, but it is downloaded upon installation.
Dictionary option: command lineYou can specify the dictionary with the tokenize option -s
.
$ pip install sudachidict_small $ echo "外国人参政権" | sudachipy -s small
$ pip install sudachidict_full $ echo "外国人参政権" | sudachipy -s fullDictionary option: Python package
You can specify the dictionary with the Dicionary()
argument; config_path
or dict_type
.
class Dictionary(config_path=None, resource_dir=None, dict_type=None)
config_path
config_path
(See [Dictionary in The Setting File](#Dictionary in The Setting File) for the detail).systemDict
, SudachiPy will use the dictionary.dict_type
dict_type
.small
, core
, or full
.config_path
and dict_type
, a dictionary defined dict_type
overrides those defined in the config path.from sudachipy import tokenizer from sudachipy import dictionary # default: sudachidict_core tokenizer_obj = dictionary.Dictionary().create() # The dictionary given by the `systemDict` key in the config file (/path/to/sudachi.json) will be used tokenizer_obj = dictionary.Dictionary(config_path="/path/to/sudachi.json").create() # The dictionary specified by `dict_type` will be set. tokenizer_obj = dictionary.Dictionary(dict_type="core").create() # sudachidict_core (same as default) tokenizer_obj = dictionary.Dictionary(dict_type="small").create() # sudachidict_small tokenizer_obj = dictionary.Dictionary(dict_type="full").create() # sudachidict_full # The dictionary specified by `dict_type` overrides those defined in the config path. # In the following code, `sudachidict_full` will be used regardless of a dictionary defined in the config file. tokenizer_obj = dictionary.Dictionary(config_path="/path/to/sudachi.json", dict_type="full").create()Dictionary in The Setting File
Alternatively, if the dictionary file is specified in the setting file, sudachi.json
, SudachiPy will use that file.
{
"systemDict" : "relative/path/to/system.dic",
...
}
The default setting file is sudachipy/resources/sudachi.json. You can specify your sudachi.json
with the -r
option.
$ sudachipy -r path/to/sudachi.json
To use a user dictionary, user.dic
, place sudachi.json to anywhere you like, and add userDict
value with the relative path from sudachi.json
to your user.dic
.
{ "userDict" : ["relative/path/to/user.dic"], ... }
Then specify your sudachi.json
with the -r
option.
$ sudachipy -r path/to/sudachi.json
You can build a user dictionary with the subcommand ubuild
.
WARNING: v0.3.* ubuild contains bug.
$ sudachipy ubuild -h usage: sudachipy ubuild [-h] [-d string] [-o file] [-s file] file [file ...] Build User Dictionary positional arguments: file source files with CSV format (one or more) optional arguments: -h, --help show this help message and exit -d string description comment to be embedded on dictionary -o file output file (default: user.dic) -s file system dictionary path (default: system core dictionary path)
About the dictionary file format, please refer to this document (written in Japanese, English version is not available yet).
Customized System Dictionary$ sudachipy build -h usage: sudachipy build [-h] [-o file] [-d string] -m file file [file ...] Build Sudachi Dictionary positional arguments: file source files with CSV format (one of more) optional arguments: -h, --help show this help message and exit -o file output file (default: system.dic) -d string description comment to be embedded on dictionary required named arguments: -m file connection matrix file with MeCab's matrix.def format
To use your customized system.dic
, place sudachi.json to anywhere you like, and overwrite systemDict
value with the relative path from sudachi.json
to your system.dic
.
{
"systemDict" : "relative/path/to/system.dic",
...
}
Then specify your sudachi.json
with the -r
option.
$ sudachipy -r path/to/sudachi.json
$ python setup.py build_ext --inplace
Run scripts/format.sh
to check if your code is formatted correctly.
You need packages flake8
flake8-import-order
flake8-buitins
(See requirements.txt
).
Run scripts/test.sh
to run the tests.
Sudachi and SudachiPy are developed by WAP Tokushima Laboratory of AI and NLP.
Open an issue, or come to our Slack workspace for questions and discussion.
https://sudachi-dev.slack.com/ (Get invitation here)
Enjoy tokenization!
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4