RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/keyATM/keyATM/issues/189 below:

Garbled characters issue during keyATM_read() · Issue #189 · keyATM/keyATM · GitHub

Problem summary (required):

Thank you so much for making this wonderful package! I have successfully replicated sample estimations shown in the package homepage. However, while I was trying to run keyATM_read() using Chinese text data, garbled characters issue raised in my environment. I have checked text encoding by as_utf8() and utf8_valid(), but it seems some issues happen during the keyATM_read().

I regularly use quanteda by using same Chinese data, and had no such trouble yet. I also found another user does not face this issue, using same text data, and running same code. So, I wonder my environment would have some issues. I guess this is primarily encoding issue and I should resolve it by myself but let me post here. For your information, my default text encoding in R is the UTF-8 (checked tool -> code -> saving in R).

Please let me know if you need more descriptions on my setting, script, and/or data.

Thank you!

Environment information (required):

sessionInfo()  # please run this in R and copy&paste the output
R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22000)

Matrix products: default

locale:
[1] LC_COLLATE=Japanese_Japan.932  LC_CTYPE=Japanese_Japan.932  
[3] LC_MONETARY=Japanese_Japan.932 LC_NUMERIC=C                  
[5] LC_TIME=Japanese_Japan.932    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
 [1] showtext_0.9-5  showtextdb_3.0  sysfonts_0.8.8  utf8_1.2.2    
 [5] lubridate_1.8.0 jiebaR_0.11     jiebaRD_0.1     readxl_1.4.1  
 [9] forcats_0.5.2   stringr_1.4.1   dplyr_1.0.10    purrr_0.3.4    
[13] readr_2.1.3     tidyr_1.2.1     tibble_3.1.8    ggplot2_3.4.0  
[17] tidyverse_1.3.2 keyATM_0.4.1    quanteda_3.2.3

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9             lattice_0.20-45        assertthat_0.2.1      
 [4] R6_2.5.1               cellranger_1.1.0       backports_1.4.1      
 [7] reprex_2.0.2           httr_1.4.4             pillar_1.8.1          
[10] rlang_1.0.6            googlesheets4_1.0.1    rstudioapi_0.14      
[13] quanteda.corpora_0.9.2 Matrix_1.4-1           textshaping_0.3.6    
[16] googledrive_2.0.0      munsell_0.5.0          broom_1.0.1          
[19] compiler_4.1.2         modelr_0.1.9           systemfonts_1.0.4    
[22] pkgconfig_2.0.3        tidyselect_1.2.0       fansi_1.0.3          
[25] crayon_1.5.2           tzdb_0.3.0             dbplyr_2.2.1          
[28] withr_2.5.0            grid_4.1.2             jsonlite_1.8.2        
[31] gtable_0.3.1           lifecycle_1.0.3        DBI_1.1.3            
[34] magrittr_2.0.3         scales_1.2.1           RcppParallel_5.1.5    
[37] cli_3.4.1              stringi_1.7.6          fs_1.5.2              
[40] xml2_1.3.3             ragg_1.2.2             ellipsis_0.3.2        
[43] stopwords_2.3          generics_0.1.3         vctrs_0.5.1          
[46] fastmatch_1.1-3        RColorBrewer_1.1-3     tools_4.1.2          
[49] glue_1.6.2             hms_1.1.2              colorspace_2.0-3      
[52] gargle_1.2.1           rvest_1.0.3            haven_2.5.1

Actual output (required):

By checking the wd_names object in keyATM_docs after running keyATM_read(), I found garbled characters. A dfm is constructed by quanteda, and no garbled characters until keyATM_read(). I have checked topfeatures(dfm, 100) and found no issue just before the keyATM_read().

# Please copy and paste the error message
> keyATM_docs <- keyATM_read(dfm,
+                            encoding = "UTF-8",
+                            check = TRUE,
+                            keep_docnames = FALSE,
+                            progress_bar = FALSE,
+                            split = 0)
Using quanteda dfm.
 
> summary(keyATM_docs)
keyATM_docs object of: 983 documents.
Length of documents:
  Avg: 542.942
  Min: 48
  Max: 4706
   SD: 551.392
Number of unique words: 12616

keyATM_docs[["wd_names"]]
   [1] "蝮壼ｮ壻ｸ咲ｧｻ"                   "襍ｰ"                          
   [3] "荳ｭ蝗ｽ迚ｹ濶ｲ遉ｾ莨壻ｸｻ荵<89>"    "豕墓ｲｻ"                        
   [5] "驕楢ｷｯ"                         "蜈ｨ髱｢"                        
   [7] "謗ｨ霑<9b>"                      "萓晄ｳ墓ｲｻ蝗ｽ"

As the wd_names has trouble, visualize_keywords() does not work.

> key_viz <- visualize_keywords(docs = keyATM_docs, keywords = keywords)
Warning in check_keywords(unique(unlisted), keywords, prune) :
  Keywords will be pruned because they do not appear in documents: <U+7ECF><U+6D4E>, 商<U+4E1A>, <U+519B>
Error in check_keywords(unique(unlisted), keywords, prune) :
  All keywords are pruned. Please check: econ, military

Expected output (it helps us a lot):

I hope to have a basic result as shown in the Preparation page (https://keyatm.github.io/keyATM/articles/pkgdown_files/Preparation.html).

Thank you for your considerations!

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4