The goal of {matchmaker} is to provide dictionary-based cleaning for R users in a simple and intuitive manner built on the {forcats} package. Some of the features of this package include:
You can install {matchmaker} from CRAN:
install.packages("matchmaker")
The matchmaker package has two user-facing functions that perform dictionary-based cleaning:
match_vec()
will translate the values in a single vectormatch_df()
will translate values in all specified columns of a data frameEach of these functions have four manditory options:
x
: your data. This will be a vector or data frame depending on the function.dictionary
: This is a data frame with at least two columns specifying keys and values to modifyfrom
: a character or number specifying which column contains the keysto
: a character or number specifying which column contains the valuesMostly, users will be working with match_df()
to transform values across specific columns. A typical workflow would be to:
library("matchmaker") # Read in data set dat <- read.csv(matchmaker_example("coded-data.csv"), stringsAsFactors = FALSE ) dat$date <- as.Date(dat$date) # Read in dictionary dict <- read.csv(matchmaker_example("spelling-dictionary.csv"), stringsAsFactors = FALSE )
This is the top of our data set, generated for example purposes
id date readmission treated facility age_group lab_result_01 lab_result_02 lab_result_03 has_symptoms followup ef267c 2019-07-08 NA 0 C 10 unk high inc NA u e80a37 2019-07-07 y 0 3 10 inc unk norm y oui b72883 2019-07-07 y 1 8 30 inc norm inc oui c9ee86 2019-07-09 n 1 4 40 inc inc unk y oui 40bc7a 2019-07-12 n 1 6 0 norm unk norm NA n 46566e 2019-07-14 y NA B 50 unk unk inc NA NAThe dictionary looks like this:
options values grp orders y Yes readmission 1 n No readmission 2 u Unknown readmission 3 .missing Missing readmission 4 0 Yes treated 1 1 No treated 2 .missing Missing treated 3 1 Facility 1 facility 1 2 Facility 2 facility 2 3 Facility 3 facility 3 4 Facility 4 facility 4 5 Facility 5 facility 5 6 Facility 6 facility 6 7 Facility 7 facility 7 8 Facility 8 facility 8 9 Facility 9 facility 9 10 Facility 10 facility 10 .default Unknown facility 11 0 0-9 age_group 1 10 10-19 age_group 2 20 20-29 age_group 3 30 30-39 age_group 4 40 40-49 age_group 5 50 50+ age_group 6 high High .regex ^lab_result_ 1 norm Normal .regex ^lab_result_ 2 inc Inconclusive .regex ^lab_result_ 3 y yes .global Inf n no .global Inf u unknown .global Inf unk unknown .global Inf oui yes .global Inf .missing missing .global Inf# Clean spelling based on dictionary ----------------------------- cleaned <- match_df(dat, dictionary = dict, from = "options", to = "values", by = "grp" ) head(cleaned) #> id date readmission treated facility age_group #> 1 ef267c 2019-07-08 Missing Yes Unknown 10-19 #> 2 e80a37 2019-07-07 Yes Yes Facility 3 10-19 #> 3 b72883 2019-07-07 Yes No Facility 8 30-39 #> 4 c9ee86 2019-07-09 No No Facility 4 40-49 #> 5 40bc7a 2019-07-12 No No Facility 6 0-9 #> 6 46566e 2019-07-14 Yes Missing Unknown 50+ #> lab_result_01 lab_result_02 lab_result_03 has_symptoms followup #> 1 unknown High Inconclusive missing unknown #> 2 Inconclusive unknown Normal yes yes #> 3 Inconclusive Normal Inconclusive missing yes #> 4 Inconclusive Inconclusive unknown yes yes #> 5 Normal unknown Normal missing no #> 6 unknown unknown Inconclusive missing missing
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4