The wordvector package is developed to create word and document vectors using quanteda. This package currently supports word2vec (Mikolov et al., 2013) and latent semantic analysis (Deerwester et al., 1990).
wordvector is available on CRAN.
install.packages("wordvector")
The latest version is available on Github.
remotes::install_github("koheiw/wordvector")
We train the word2vec model on a corpus of news summaries collected from Yahoo News via RSS between 2012 and 2016.
# download data download.file('https://www.dropbox.com/s/e19kslwhuu9yc2z/yahoo-news.RDS?dl=1', '~/yahoo-news.RDS', mode = "wb")
library(wordvector) library(quanteda) ## Package version: 4.2.1 ## Unicode version: 15.1 ## ICU version: 74.1 ## Parallel computing: 16 of 16 threads used. ## See https://quanteda.io for tutorials and examples. # Load data dat <- readRDS('~/yahoo-news.RDS') dat$text <- paste0(dat$head, ". ", dat$body) corp <- corpus(dat, text_field = 'text', docid_field = "tid") # Pre-processing toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE) %>% tokens_remove(stopwords("en", "marimo"), padding = TRUE) %>% tokens_select("^[a-zA-Z-]+$", valuetype = "regex", case_insensitive = FALSE, padding = TRUE) # Train word2vec wdv <- textmodel_word2vec(toks, dim = 50, type = "cbow", min_count = 5, verbose = TRUE) ## Training CBOW model with 50 dimensions ## ...using 16 threads for distributed computing ## ...initializing ## ...negative sampling in 10 iterations ## ......iteration 1 elapsed time: 4.99 seconds (alpha: 0.0465) ## ......iteration 2 elapsed time: 10.00 seconds (alpha: 0.0431) ## ......iteration 3 elapsed time: 14.97 seconds (alpha: 0.0396) ## ......iteration 4 elapsed time: 19.94 seconds (alpha: 0.0362) ## ......iteration 5 elapsed time: 25.11 seconds (alpha: 0.0326) ## ......iteration 6 elapsed time: 30.14 seconds (alpha: 0.0291) ## ......iteration 7 elapsed time: 35.12 seconds (alpha: 0.0257) ## ......iteration 8 elapsed time: 40.22 seconds (alpha: 0.0222) ## ......iteration 9 elapsed time: 45.15 seconds (alpha: 0.0188) ## ......iteration 10 elapsed time: 50.32 seconds (alpha: 0.0152) ## ...completeSimilarity between word vectors
similarity()
computes cosine similarity between word vectors.
head(similarity(wdv, c("amazon", "forests", "obama", "america", "afghanistan"), mode = "word")) ## amazon forests obama america ## [1,] "amazon" "forests" "obama" "america" ## [2,] "rainforest" "herds" "barack" "africa" ## [3,] "plantations" "rainforests" "biden" "dakota" ## [4,] "farms" "rainforest" "kerry" "american" ## [5,] "patagonia" "plantations" "administration" "carolina" ## [6,] "warm-water" "farmland" "hagel" "america-focused" ## afghanistan ## [1,] "afghanistan" ## [2,] "afghan" ## [3,] "kabul" ## [4,] "pakistan" ## [5,] "taliban" ## [6,] "afghans"Arithmetic operations of word vectors
analogy()
offers interface for arithmetic operations of word vectors.
# What is Amazon without forests? head(similarity(wdv, analogy(~ amazon - forests))) ## [,1] ## [1,] "choo" ## [2,] "smash-hit" ## [3,] "yahoo" ## [4,] "tripadvisor" ## [5,] "univision" ## [6,] "dreamworks"
# What is for Afghanistan as Obama for America? head(similarity(wdv, analogy(~ obama - america + afghanistan))) ## [,1] ## [1,] "taliban" ## [2,] "afghanistan" ## [3,] "karzai" ## [4,] "hagel" ## [5,] "hamid" ## [6,] "obama"
These examples replicates analogical tasks in the original word2vec paper.
# What is for France as Berlin for Germany? head(similarity(wdv, analogy(~ berlin - germany + france))) ## [,1] ## [1,] "paris" ## [2,] "berlin" ## [3,] "bourget" ## [4,] "brussels" ## [5,] "amsterdam" ## [6,] "france"
# What is for slowly as quick for quickly? head(similarity(wdv, analogy(~ quick - quickly + slowly))) ## [,1] ## [1,] "uneven" ## [2,] "gravity-defying" ## [3,] "slow" ## [4,] "super-charged" ## [5,] "buck" ## [6,] "sideways"
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4