A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from http://cran.rstudio.com/web/packages/Rcpp/../SparseICA/../Rcpp/../triebeard/vignettes/r_radix.html below:

Radix trees in R

A radix tree, or trie, is a data structure optimised for storing key-value pairs in a way optimised for searching. This makes them very, very good for efficiently matching data against keys, and retrieving the values associated with those keys.

triebeard provides an implementation of tries for R (and one that can be used in Rcpp development, too, if that’s your thing) so that useRs can take advantage of the fast, efficient and user-friendly matching that they allow.

Radix usage

Suppose we have observations in a dataset that are labelled, with a 2-3 letter code that identifies the facility the sample came from:

labels <- c("AO-1002", "AEO-1004", "AAI-1009", "AFT-1403", "QZ-9065", "QZ-1021", "RF-0901",
            "AO-1099", "AFT-1101", "QZ-4933")

We know the facility each code maps to, and we want to be able to map the labels to that - not over 10 entries but over hundreds, or thousands, or hundreds of thousands. Tries are a great way of doing that: we treat the codes as keys and the full facility names as values. So let’s make a trie to do this matching, and then, well, match:

library(triebeard)
trie <- trie(keys = c("AO", "AEO", "AAI", "AFT", "QZ", "RF"),
             values = c("Audobon", "Atlanta", "Ann Arbor", "Austin", "Queensland", "Raleigh"))

longest_match(trie = trie, to_match = labels)

 [1] "Audobon"    "Atlanta"    "Ann Arbor"  "Austin"     "Queensland" "Queensland" "Raleigh"    "Audobon"    "Austin"    
[10] "Queensland"

This pulls out, for each label, the trie value where the associated key has the longest prefix-match to the label. We can also just grab all the values where the key starts with, say, A:

prefix_match(trie = trie, to_match = "A")

[[1]]
[1] "Ann Arbor" "Atlanta"   "Austin"    "Audobon"  

And finally if we want we can match very, very fuzzily using “greedy” matching:

greedy_match(trie = trie, to_match = "AO")

[[1]]
[1] "Ann Arbor" "Atlanta"   "Austin"    "Audobon"  

These operations are very, very efficient. If we use longest-match as an example, since that’s the most useful thing, with a one-million element vector of things to match against:

library(triebeard)
library(microbenchmark)

trie <- trie(keys = c("AO", "AEO", "AAI", "AFT", "QZ", "RF"),
             values = c("Audobon", "Atlanta", "Ann Arbor", "Austin", "Queensland", "Raleigh"))

labels <- rep(c("AO-1002", "AEO-1004", "AAI-1009", "AFT-1403", "QZ-9065", "QZ-1021", "RF-0901",
                "AO-1099", "AFT-1101", "QZ-4933"), 100000)

microbenchmark({longest_match(trie = trie, to_match = labels)})

Unit: milliseconds
                                                  expr      min       lq     mean   median       uq      max neval
 {     longest_match(trie = trie, to_match = labels) } 284.6457 285.5902 289.5342 286.8775 288.4564 327.3878   100

I think we can call <300 milliseconds for a million matches against an entire set of possible values pretty fast.

Radix modification

There’s always the possibility that (horror of horrors) you’ll have to add or remove entries from the trie. Fear not; you can do just that with trie_add and trie_remove respectively, both of which silently modify the trie they’re provided with to add or remove whatever key-value pairs you provide:

to_match = "198.0.0.1"
trie_inst <- trie(keys = "197", values = "fake range")

longest_match(trie_inst, to_match)
[1] NA

trie_add(trie_inst, keys = "198", values = "home range")
longest_match(trie_inst, to_match)
[1] "home range"

trie_remove(trie_inst, keys = "198")
longest_match(trie_inst, to_match)
[1] NA

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4