A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/reconhub/matchmaker below:

reconhub/matchmaker: Dictionary-based cleaning for categorical variables

The goal of {matchmaker} is to provide dictionary-based cleaning for R users in a simple and intuitive manner built on the {forcats} package. Some of the features of this package include:

You can install {matchmaker} from CRAN:

install.packages("matchmaker")

The matchmaker package has two user-facing functions that perform dictionary-based cleaning:

Each of these functions have four manditory options:

Mostly, users will be working with match_df() to transform values across specific columns. A typical workflow would be to:

  1. construct your dictionary in a spreadsheet program based on your data
  2. read in your data and dictionary to data frames in R
  3. match!
library("matchmaker")

# Read in data set
dat <- read.csv(matchmaker_example("coded-data.csv"),
  stringsAsFactors = FALSE
)
dat$date <- as.Date(dat$date)

# Read in dictionary
dict <- read.csv(matchmaker_example("spelling-dictionary.csv"),
  stringsAsFactors = FALSE
)

This is the top of our data set, generated for example purposes

id date readmission treated facility age_group lab_result_01 lab_result_02 lab_result_03 has_symptoms followup ef267c 2019-07-08 NA 0 C 10 unk high inc NA u e80a37 2019-07-07 y 0 3 10 inc unk norm y oui b72883 2019-07-07 y 1 8 30 inc norm inc oui c9ee86 2019-07-09 n 1 4 40 inc inc unk y oui 40bc7a 2019-07-12 n 1 6 0 norm unk norm NA n 46566e 2019-07-14 y NA B 50 unk unk inc NA NA

The dictionary looks like this:

options values grp orders y Yes readmission 1 n No readmission 2 u Unknown readmission 3 .missing Missing readmission 4 0 Yes treated 1 1 No treated 2 .missing Missing treated 3 1 Facility 1 facility 1 2 Facility 2 facility 2 3 Facility 3 facility 3 4 Facility 4 facility 4 5 Facility 5 facility 5 6 Facility 6 facility 6 7 Facility 7 facility 7 8 Facility 8 facility 8 9 Facility 9 facility 9 10 Facility 10 facility 10 .default Unknown facility 11 0 0-9 age_group 1 10 10-19 age_group 2 20 20-29 age_group 3 30 30-39 age_group 4 40 40-49 age_group 5 50 50+ age_group 6 high High .regex ^lab_result_ 1 norm Normal .regex ^lab_result_ 2 inc Inconclusive .regex ^lab_result_ 3 y yes .global Inf n no .global Inf u unknown .global Inf unk unknown .global Inf oui yes .global Inf .missing missing .global Inf
# Clean spelling based on dictionary -----------------------------
cleaned <- match_df(dat,
  dictionary = dict,
  from = "options",
  to = "values",
  by = "grp"
)
head(cleaned)
#>       id       date readmission treated    facility age_group
#> 1 ef267c 2019-07-08     Missing     Yes     Unknown     10-19
#> 2 e80a37 2019-07-07         Yes     Yes Facility  3     10-19
#> 3 b72883 2019-07-07         Yes      No Facility  8     30-39
#> 4 c9ee86 2019-07-09          No      No Facility  4     40-49
#> 5 40bc7a 2019-07-12          No      No Facility  6       0-9
#> 6 46566e 2019-07-14         Yes Missing     Unknown       50+
#>   lab_result_01 lab_result_02 lab_result_03 has_symptoms followup
#> 1       unknown          High  Inconclusive      missing  unknown
#> 2  Inconclusive       unknown        Normal          yes      yes
#> 3  Inconclusive        Normal  Inconclusive      missing      yes
#> 4  Inconclusive  Inconclusive       unknown          yes      yes
#> 5        Normal       unknown        Normal      missing       no
#> 6       unknown       unknown  Inconclusive      missing  missing

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4