RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/reconhub/matchmaker below:

reconhub/matchmaker: Dictionary-based cleaning for categorical variables

The goal of {matchmaker} is to provide dictionary-based cleaning for R users in a simple and intuitive manner built on the {forcats} package. Some of the features of this package include:

preservation of factor orders
ability to specify explicit and implicit missing values
option to replace by fuzzy matching (regular expressions, anchored by default)
optional variable selection by fuzzy matching

You can install {matchmaker} from CRAN:

install.packages("matchmaker")

The matchmaker package has two user-facing functions that perform dictionary-based cleaning:

match_vec() will translate the values in a single vector
match_df() will translate values in all specified columns of a data frame

Each of these functions have four manditory options:

x: your data. This will be a vector or data frame depending on the function.
dictionary: This is a data frame with at least two columns specifying keys and values to modify
from: a character or number specifying which column contains the keys
to: a character or number specifying which column contains the values

Mostly, users will be working with match_df() to transform values across specific columns. A typical workflow would be to:

construct your dictionary in a spreadsheet program based on your data
read in your data and dictionary to data frames in R
match!

library("matchmaker")

# Read in data set
dat <- read.csv(matchmaker_example("coded-data.csv"),
  stringsAsFactors = FALSE
)
dat$date <- as.Date(dat$date)

# Read in dictionary
dict <- read.csv(matchmaker_example("spelling-dictionary.csv"),
  stringsAsFactors = FALSE
)

This is the top of our data set, generated for example purposes

id date readmission treated facility age_group lab_result_01 lab_result_02 lab_result_03 has_symptoms followup ef267c 2019-07-08 NA 0 C 10 unk high inc NA u e80a37 2019-07-07 y 0 3 10 inc unk norm y oui b72883 2019-07-07 y 1 8 30 inc norm inc oui c9ee86 2019-07-09 n 1 4 40 inc inc unk y oui 40bc7a 2019-07-12 n 1 6 0 norm unk norm NA n 46566e 2019-07-14 y NA B 50 unk unk inc NA NA

The dictionary looks like this:

options values grp orders y Yes readmission 1 n No readmission 2 u Unknown readmission 3 .missing Missing readmission 4 0 Yes treated 1 1 No treated 2 .missing Missing treated 3 1 Facility 1 facility 1 2 Facility 2 facility 2 3 Facility 3 facility 3 4 Facility 4 facility 4 5 Facility 5 facility 5 6 Facility 6 facility 6 7 Facility 7 facility 7 8 Facility 8 facility 8 9 Facility 9 facility 9 10 Facility 10 facility 10 .default Unknown facility 11 0 0-9 age_group 1 10 10-19 age_group 2 20 20-29 age_group 3 30 30-39 age_group 4 40 40-49 age_group 5 50 50+ age_group 6 high High .regex ^lab_result_ 1 norm Normal .regex ^lab_result_ 2 inc Inconclusive .regex ^lab_result_ 3 y yes .global Inf n no .global Inf u unknown .global Inf unk unknown .global Inf oui yes .global Inf .missing missing .global Inf

# Clean spelling based on dictionary -----------------------------
cleaned <- match_df(dat,
  dictionary = dict,
  from = "options",
  to = "values",
  by = "grp"
)
head(cleaned)
#>       id       date readmission treated    facility age_group
#> 1 ef267c 2019-07-08     Missing     Yes     Unknown     10-19
#> 2 e80a37 2019-07-07         Yes     Yes Facility  3     10-19
#> 3 b72883 2019-07-07         Yes      No Facility  8     30-39
#> 4 c9ee86 2019-07-09          No      No Facility  4     40-49
#> 5 40bc7a 2019-07-12          No      No Facility  6       0-9
#> 6 46566e 2019-07-14         Yes Missing     Unknown       50+
#>   lab_result_01 lab_result_02 lab_result_03 has_symptoms followup
#> 1       unknown          High  Inconclusive      missing  unknown
#> 2  Inconclusive       unknown        Normal          yes      yes
#> 3  Inconclusive        Normal  Inconclusive      missing      yes
#> 4  Inconclusive  Inconclusive       unknown          yes      yes
#> 5        Normal       unknown        Normal      missing       no
#> 6       unknown       unknown  Inconclusive      missing  missing

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4