A radix tree, or trie, is a data structure optimised for storing key-value pairs in a way optimised for searching. This makes them very, very good for efficiently matching data against keys, and retrieving the values associated with those keys.
triebeard
provides an implementation of tries for R (and one that can be used in Rcpp development, too, if thatâs your thing) so that useRs can take advantage of the fast, efficient and user-friendly matching that they allow.
Suppose we have observations in a dataset that are labelled, with a 2-3 letter code that identifies the facility the sample came from:
labels <- c("AO-1002", "AEO-1004", "AAI-1009", "AFT-1403", "QZ-9065", "QZ-1021", "RF-0901",
"AO-1099", "AFT-1101", "QZ-4933")
We know the facility each code maps to, and we want to be able to map the labels to that - not over 10 entries but over hundreds, or thousands, or hundreds of thousands. Tries are a great way of doing that: we treat the codes as keys and the full facility names as values. So letâs make a trie to do this matching, and then, well, match:
library(triebeard)
trie <- trie(keys = c("AO", "AEO", "AAI", "AFT", "QZ", "RF"),
values = c("Audobon", "Atlanta", "Ann Arbor", "Austin", "Queensland", "Raleigh"))
longest_match(trie = trie, to_match = labels)
[1] "Audobon" "Atlanta" "Ann Arbor" "Austin" "Queensland" "Queensland" "Raleigh" "Audobon" "Austin"
[10] "Queensland"
This pulls out, for each label, the trie value where the associated key has the longest prefix-match to the label. We can also just grab all the values where the key starts with, say, A:
prefix_match(trie = trie, to_match = "A")
[[1]]
[1] "Ann Arbor" "Atlanta" "Austin" "Audobon"
And finally if we want we can match very, very fuzzily using âgreedyâ matching:
greedy_match(trie = trie, to_match = "AO")
[[1]]
[1] "Ann Arbor" "Atlanta" "Austin" "Audobon"
These operations are very, very efficient. If we use longest-match as an example, since thatâs the most useful thing, with a one-million element vector of things to match against:
library(triebeard)
library(microbenchmark)
trie <- trie(keys = c("AO", "AEO", "AAI", "AFT", "QZ", "RF"),
values = c("Audobon", "Atlanta", "Ann Arbor", "Austin", "Queensland", "Raleigh"))
labels <- rep(c("AO-1002", "AEO-1004", "AAI-1009", "AFT-1403", "QZ-9065", "QZ-1021", "RF-0901",
"AO-1099", "AFT-1101", "QZ-4933"), 100000)
microbenchmark({longest_match(trie = trie, to_match = labels)})
Unit: milliseconds
expr min lq mean median uq max neval
{ longest_match(trie = trie, to_match = labels) } 284.6457 285.5902 289.5342 286.8775 288.4564 327.3878 100
I think we can call <300 milliseconds for a million matches against an entire set of possible values pretty fast.
Radix modificationThereâs always the possibility that (horror of horrors) youâll have to add or remove entries from the trie. Fear not; you can do just that with trie_add
and trie_remove
respectively, both of which silently modify the trie theyâre provided with to add or remove whatever key-value pairs you provide:
to_match = "198.0.0.1"
trie_inst <- trie(keys = "197", values = "fake range")
longest_match(trie_inst, to_match)
[1] NA
trie_add(trie_inst, keys = "198", values = "home range")
longest_match(trie_inst, to_match)
[1] "home range"
trie_remove(trie_inst, keys = "198")
longest_match(trie_inst, to_match)
[1] NA
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4