A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/poissonconsulting/newdata below:

poissonconsulting/newdata: An R Package to Generate New Data Frames for Prediction

newdata

newdata is an R package to generate new data frames by varying some variables while holding the others constant.

By default, all specified variables vary across their range while all other variables are held constant at a reference value. The user can specify the length of each sequence, require that only observed values and combinations are used and add new variables. Types, classes, factor levels and time zones are always preserved.

Consider the following observed ‘old’ data frame.

library(newdata)

newdata::old_data
#> # A tibble: 3 × 9
#>   lgl     int   dbl chr      fct     ord   dte        dtt                 hms   
#>   <lgl> <int> <dbl> <chr>    <fct>   <ord> <date>     <dttm>              <time>
#> 1 TRUE      1   1   most     most    most  1970-01-02 1969-12-31 16:00:01 00'01"
#> 2 FALSE     4   4.5 most     most    most  1970-01-05 1969-12-31 16:00:04 00'04"
#> 3 NA        6   8.2 a rarity a rari… a ra… 1970-01-07 1969-12-31 16:00:06 00'06"

By default all variables are set to a reference value.

xnew_data(old_data)
#> # A tibble: 1 × 9
#>   lgl     int   dbl chr   fct     ord      dte        dtt                 hms   
#>   <lgl> <int> <dbl> <chr> <fct>   <ord>    <date>     <dttm>              <time>
#> 1 FALSE     3  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"

The reference value depends on the class of the variable, by default:

Specifying a variable causes it to vary sequentially across its range.

xnew_data(old_data, int)
#> # A tibble: 6 × 9
#>   lgl     int   dbl chr   fct     ord      dte        dtt                 hms   
#>   <lgl> <int> <dbl> <chr> <fct>   <ord>    <date>     <dttm>              <time>
#> 1 FALSE     1  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"
#> 2 FALSE     2  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"
#> 3 FALSE     3  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"
#> 4 FALSE     4  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"
#> 5 FALSE     5  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"
#> 6 FALSE     6  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"

By default the sequence depends on the class of the variable:

These values can be overridden by setting the following options:

  1. The length of Date, POSIXct and hms sequences are controlled by new_data.length_out_int as they are treated as integers for the purpose of generating a sequence.

When programming it is strongly recommended that the user explicitly specify the length of each sequence individually.

xnew_data(old_data, lgl, xnew_seq(int, length_out = 3))
#> # A tibble: 6 × 9
#>   lgl     int   dbl chr   fct     ord      dte        dtt                 hms   
#>   <lgl> <int> <dbl> <chr> <fct>   <ord>    <date>     <dttm>              <time>
#> 1 FALSE     1  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"
#> 2 FALSE     3  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"
#> 3 FALSE     6  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"
#> 4 TRUE      1  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"
#> 5 TRUE      3  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"
#> 6 TRUE      6  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"

A third alternative is to specify the length of all the sequences in the data set but this can result in less common character strings or later factor or ordered levels being dropped.

xnew_data(old_data, dbl, int, .length_out = 2)
#> # A tibble: 4 × 9
#>   lgl     int   dbl chr   fct     ord      dte        dtt                 hms   
#>   <lgl> <int> <dbl> <chr> <fct>   <ord>    <date>     <dttm>              <time>
#> 1 FALSE     1   1   most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"
#> 2 FALSE     6   1   most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"
#> 3 FALSE     1   8.2 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"
#> 4 FALSE     6   8.2 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"

The user can also indicate whether only observed values should be used in the sequence.

xnew_data(old_data, xnew_seq(int, length_out = 3, obs_only = TRUE))
#> # A tibble: 3 × 9
#>   lgl     int   dbl chr   fct     ord      dte        dtt                 hms   
#>   <lgl> <int> <dbl> <chr> <fct>   <ord>    <date>     <dttm>              <time>
#> 1 FALSE     1  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"
#> 2 FALSE     4  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"
#> 3 FALSE     6  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"

The xobs_only() function can be used to filter out unobserved values after the sequence has been generated.

xnew_data(old_data, xobs_only(xnew_seq(int, length_out = 3)))
#> # A tibble: 2 × 9
#>   lgl     int   dbl chr   fct     ord      dte        dtt                 hms   
#>   <lgl> <int> <dbl> <chr> <fct>   <ord>    <date>     <dttm>              <time>
#> 1 FALSE     1  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"
#> 2 FALSE     6  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"

and when two or more variables are specified all combinations are used.

xnew_data(old_data, int, fct)
#> # A tibble: 18 × 9
#>    lgl     int   dbl chr   fct      ord    dte        dtt                 hms   
#>    <lgl> <int> <dbl> <chr> <fct>    <ord>  <date>     <dttm>              <time>
#>  1 FALSE     1  4.57 most  not obs  a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#>  2 FALSE     1  4.57 most  a rarity a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#>  3 FALSE     1  4.57 most  most     a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#>  4 FALSE     2  4.57 most  not obs  a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#>  5 FALSE     2  4.57 most  a rarity a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#>  6 FALSE     2  4.57 most  most     a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#>  7 FALSE     3  4.57 most  not obs  a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#>  8 FALSE     3  4.57 most  a rarity a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#>  9 FALSE     3  4.57 most  most     a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#> 10 FALSE     4  4.57 most  not obs  a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#> 11 FALSE     4  4.57 most  a rarity a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#> 12 FALSE     4  4.57 most  most     a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#> 13 FALSE     5  4.57 most  not obs  a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#> 14 FALSE     5  4.57 most  a rarity a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#> 15 FALSE     5  4.57 most  most     a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#> 16 FALSE     6  4.57 most  not obs  a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#> 17 FALSE     6  4.57 most  a rarity a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#> 18 FALSE     6  4.57 most  most     a rar… 1970-01-04 1969-12-31 16:00:03 00'03"

to only get observed combinations.

xnew_data(old_data, xobs_only(int, fct))
#> # A tibble: 3 × 9
#>   lgl     int   dbl chr   fct      ord     dte        dtt                 hms   
#>   <lgl> <int> <dbl> <chr> <fct>    <ord>   <date>     <dttm>              <time>
#> 1 FALSE     1  4.57 most  most     a rari… 1970-01-04 1969-12-31 16:00:03 00'03"
#> 2 FALSE     4  4.57 most  most     a rari… 1970-01-04 1969-12-31 16:00:03 00'03"
#> 3 FALSE     6  4.57 most  a rarity a rari… 1970-01-04 1969-12-31 16:00:03 00'03"

Modifying an existing variable or changing an existing one is simple.

xnew_data(old_data, lgl = median(lgl, na.rm = TRUE), extra = c(TRUE, FALSE))
#> # A tibble: 2 × 10
#>     lgl   int   dbl chr   fct     ord      dte        dtt                 hms   
#>   <dbl> <int> <dbl> <chr> <fct>   <ord>    <date>     <dttm>              <time>
#> 1   0.5     3  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"
#> 2   0.5     3  4.57 most  not obs a rarity 1970-01-04 1969-12-31 16:00:03 00'03"
#> # ℹ 1 more variable: extra <lgl>

Casting variables to be the same class as the original is achieved as follows.

xnew_data(old_data, xcast(lgl = 1, int = 7, dbl = 10L, fct = "a rarity", hms = "00:00:02"))
#> # A tibble: 1 × 9
#>   lgl     int   dbl chr   fct      ord     dte        dtt                 hms   
#>   <lgl> <int> <dbl> <chr> <fct>    <ord>   <date>     <dttm>              <time>
#> 1 TRUE      7    10 most  a rarity a rari… 1970-01-04 1969-12-31 16:00:03 00'02"

Although superseded, for consistency with existing code new_data() which is a simple wrapper on xnew_data() allows the user to pass a character vector and to specifying the length of all the sequences is also provided.

new_data(old_data, seq = c("int", "fct"), length_out = 5)
#> # A tibble: 15 × 9
#>    lgl     int   dbl chr   fct      ord    dte        dtt                 hms   
#>    <lgl> <int> <dbl> <chr> <fct>    <ord>  <date>     <dttm>              <time>
#>  1 FALSE     1  4.57 most  not obs  a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#>  2 FALSE     1  4.57 most  a rarity a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#>  3 FALSE     1  4.57 most  most     a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#>  4 FALSE     2  4.57 most  not obs  a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#>  5 FALSE     2  4.57 most  a rarity a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#>  6 FALSE     2  4.57 most  most     a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#>  7 FALSE     3  4.57 most  not obs  a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#>  8 FALSE     3  4.57 most  a rarity a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#>  9 FALSE     3  4.57 most  most     a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#> 10 FALSE     4  4.57 most  not obs  a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#> 11 FALSE     4  4.57 most  a rarity a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#> 12 FALSE     4  4.57 most  most     a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#> 13 FALSE     6  4.57 most  not obs  a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#> 14 FALSE     6  4.57 most  a rarity a rar… 1970-01-04 1969-12-31 16:00:03 00'03"
#> 15 FALSE     6  4.57 most  most     a rar… 1970-01-04 1969-12-31 16:00:03 00'03"

To install the latest release version from CRAN.

install.packages("newdata")

To install the latest development version from GitHub

# install.packages("pak")
pak::pak("poissonconsulting/newdata")

or from r-universe.

install.packages("newdata", repos = c("https://poissonconsulting.r-universe.dev", "https://cloud.r-project.org"))

Please report any issues.

Pull requests are always welcome.

Please note that the newdata project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4