Thank you so much for making this wonderful package! I have successfully replicated sample estimations shown in the package homepage. However, while I was trying to run keyATM_read() using Chinese text data, garbled characters issue raised in my environment. I have checked text encoding by as_utf8() and utf8_valid(), but it seems some issues happen during the keyATM_read().
I regularly use quanteda by using same Chinese data, and had no such trouble yet. I also found another user does not face this issue, using same text data, and running same code. So, I wonder my environment would have some issues. I guess this is primarily encoding issue and I should resolve it by myself but let me post here. For your information, my default text encoding in R is the UTF-8 (checked tool -> code -> saving in R).
Please let me know if you need more descriptions on my setting, script, and/or data.
Thank you!
Environment information (required):sessionInfo() # please run this in R and copy&paste the output
R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22000)
Matrix products: default
locale:
[1] LC_COLLATE=Japanese_Japan.932 LC_CTYPE=Japanese_Japan.932
[3] LC_MONETARY=Japanese_Japan.932 LC_NUMERIC=C
[5] LC_TIME=Japanese_Japan.932
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] showtext_0.9-5 showtextdb_3.0 sysfonts_0.8.8 utf8_1.2.2
[5] lubridate_1.8.0 jiebaR_0.11 jiebaRD_0.1 readxl_1.4.1
[9] forcats_0.5.2 stringr_1.4.1 dplyr_1.0.10 purrr_0.3.4
[13] readr_2.1.3 tidyr_1.2.1 tibble_3.1.8 ggplot2_3.4.0
[17] tidyverse_1.3.2 keyATM_0.4.1 quanteda_3.2.3
loaded via a namespace (and not attached):
[1] Rcpp_1.0.9 lattice_0.20-45 assertthat_0.2.1
[4] R6_2.5.1 cellranger_1.1.0 backports_1.4.1
[7] reprex_2.0.2 httr_1.4.4 pillar_1.8.1
[10] rlang_1.0.6 googlesheets4_1.0.1 rstudioapi_0.14
[13] quanteda.corpora_0.9.2 Matrix_1.4-1 textshaping_0.3.6
[16] googledrive_2.0.0 munsell_0.5.0 broom_1.0.1
[19] compiler_4.1.2 modelr_0.1.9 systemfonts_1.0.4
[22] pkgconfig_2.0.3 tidyselect_1.2.0 fansi_1.0.3
[25] crayon_1.5.2 tzdb_0.3.0 dbplyr_2.2.1
[28] withr_2.5.0 grid_4.1.2 jsonlite_1.8.2
[31] gtable_0.3.1 lifecycle_1.0.3 DBI_1.1.3
[34] magrittr_2.0.3 scales_1.2.1 RcppParallel_5.1.5
[37] cli_3.4.1 stringi_1.7.6 fs_1.5.2
[40] xml2_1.3.3 ragg_1.2.2 ellipsis_0.3.2
[43] stopwords_2.3 generics_0.1.3 vctrs_0.5.1
[46] fastmatch_1.1-3 RColorBrewer_1.1-3 tools_4.1.2
[49] glue_1.6.2 hms_1.1.2 colorspace_2.0-3
[52] gargle_1.2.1 rvest_1.0.3 haven_2.5.1
Actual output (required):
By checking the wd_names object in keyATM_docs after running keyATM_read(), I found garbled characters. A dfm is constructed by quanteda, and no garbled characters until keyATM_read(). I have checked topfeatures(dfm, 100) and found no issue just before the keyATM_read().
# Please copy and paste the error message
> keyATM_docs <- keyATM_read(dfm,
+ encoding = "UTF-8",
+ check = TRUE,
+ keep_docnames = FALSE,
+ progress_bar = FALSE,
+ split = 0)
Using quanteda dfm.
> summary(keyATM_docs)
keyATM_docs object of: 983 documents.
Length of documents:
Avg: 542.942
Min: 48
Max: 4706
SD: 551.392
Number of unique words: 12616
keyATM_docs[["wd_names"]]
[1] "蝮壼ョ壻ク咲ァサ" "襍ー"
[3] "荳ュ蝗ス迚ケ濶イ遉セ莨壻クサ荵<89>" "豕墓イサ"
[5] "驕楢キッ" "蜈ィ髱「"
[7] "謗ィ霑<9b>" "萓晄ウ墓イサ蝗ス"
As the wd_names has trouble, visualize_keywords() does not work.
> key_viz <- visualize_keywords(docs = keyATM_docs, keywords = keywords)
Warning in check_keywords(unique(unlisted), keywords, prune) :
Keywords will be pruned because they do not appear in documents: <U+7ECF><U+6D4E>, 商<U+4E1A>, <U+519B>
Error in check_keywords(unique(unlisted), keywords, prune) :
All keywords are pruned. Please check: econ, military
Expected output (it helps us a lot):
I hope to have a basic result as shown in the Preparation page (https://keyatm.github.io/keyATM/articles/pkgdown_files/Preparation.html).
Thank you for your considerations!
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4