User-friendly functions for extracting a data table (row for each match, column for each group) from non-tabular text data using regular expressions, and for melting/reshaping columns that match a regular expression. Please read and cite my related R Journal papers, if you use this code!
fruit.vec <- c("granny smith apple", "blood orange and yellow banana") fruit.pattern <- list(type=".*?", " ", fruit="orange|apple|banana") nc::capture_first_vec(fruit.vec, fruit.pattern) #> type fruit #> 1: granny smith apple #> 2: blood orange nc::capture_all_str(fruit.vec, fruit.pattern) #> type fruit #> 1: granny smith apple #> 2: blood orange #> 3: and yellow bananaQuick demo of reshaping functions
(one.iris <- iris[1,]) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.2 setosa nc::capture_melt_single(one.iris, part=".*", "[.]", dim=".*") #> Species part dim value #> 1: setosa Sepal Length 5.1 #> 2: setosa Sepal Width 3.5 #> 3: setosa Petal Length 1.4 #> 4: setosa Petal Width 0.2 nc::capture_melt_multiple(one.iris, part=".*", "[.]", column=".*") #> Species part Length Width #> 1: setosa Petal 1.4 0.2 #> 2: setosa Sepal 5.1 3.5 nc::capture_melt_multiple(one.iris, column=".*", "[.]", dim=".*") #> Species dim Petal Sepal #> 1: setosa Length 1.4 5.1 #> 2: setosa Width 0.2 3.5
install.packages("nc") ## or: if(!require(devtools))install.packages("devtools") devtools::install_github("tdhock/nc")
Watch the screencast tutorial videos!
The main functions provided in nc are:
Subject nc function Similar to And Single stringcapture_all_str
stringr::str_match_all
rex::re_matches
Character vector capture_first_vec
stringr::str_match
rex::re_matches
Data frame chr cols capture_first_df
tidyr::extract/separate_wider_regex
data.table::tstrsplit
Data frame col names capture_melt_single
tidyr::pivot_longer
data.table::melt
Data frame col names capture_melt_multiple
tidyr::pivot_longer
data.table::melt
File paths capture_first_glob
arrow::open_dataset
capture_first_vec
and capture_first_df
, which capture the first match in each of several subjects (character vector, data frame character columns).capture_all_str
which captures all matches in a single string, or a single multi-line text file. The vignette also shows how to use capture_all_str
on several different strings/files, using data.table by
syntax.capture_melt_single
and capture_melt_multiple
which match a regex to the column names of a wide data frame, then melt/reshape the matching columns. These functions are especially useful when more than one separate piece of information can be captured from each column name, e.g. the iris column names Petal.Width
, Sepal.Width
, etc each have two pieces of information (flower part and measurement dimension).nc::capture_first_glob
.By default, nc uses PCRE. Other options include ICU and RE2.
To tell nc that you would like to use a certain engine,
Every function also has an engine argument, e.g.
nc::capture_first_vec( "foo a\U0001F60E# bar", before=".*?", emoji="\\p{EMOJI_Presentation}", after=".*", engine="ICU") #> before emoji after #> 1 foo a 😎 # bar
For an detailed comparison of regex C libraries in R (ICU, PCRE, TRE, RE2), see my R journal (2019) paper about namedCapture.
The nc reshaping functions provide functionality similar to packages tidyr, stats, data.table, reshape, reshape2, cdata, utils, etc. The main difference is that nc::capture_melt_*
support named capture regular expressions with type conversion, which (1) makes it easier to create/maintain a complex regex, and (2) results in less repetition in user code. For a detailed comparison, see my R Journal (2021) paper about nc.
Below I list the main differences between the functions in nc
and other analogous R functions:
nc
functions all have the capture_
prefix for easy auto-completion.nc
is always a data.table (other packages output either a list, character matrix, or data frame).nc::capture_first_df
modifies the input if it is a data table, whereas tidyr
functions always copy the input table.nc::capture_first_vec
stops with an error if any subjects do not match, whereas other functions return NA/missing rows.nc::capture_all_str
only supports capturing multiple matches in a single subject (returning a data table), whereas other functions support multiple subjects (and return list of character matrices). For handling multiple subjects using nc
, use DT[, nc::capture_all_str(subject), by]
(see vignette 2 for more info).nc::capture_melt_single
and nc::capture_melt_multiple
use regex for wide-to-tall data reshaping, see Vignette 3 and my R Journal (2021) paper for more info. Whereas in nc these are two separate functions, other packages typically provide a single function which does both kinds of reshaping, for example measure in data.table
.nc::capture_first_glob
is for reading any kind of regularly named files into R using regex, whereas arrow::open_dataset
requires a particular naming scheme (does not support regex).nc::measure
can be used to create the measure.vars
argument of data.table::melt
, and nc::capture_longer_spec
can be used to create the spec
argument of tidyr::pivot_longer
. This can be useful if you want to use nc to define the regex, but you want to use the other package functions to do the reshape.nc::field
is provided for defining patterns that match subjects like variable=value, and create a column/group named variable (useful to avoid repeating variable names in regex code). See vignette 2 for more info.nc::alternatives_with_shared_groups
is provided for defining a pattern containing alternatives with shared groups. See vignette 5 for more info.RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4