This package contains R functions corresponding to useful Stata commands.
The package includes:
sum_up
prints detailed summary statistics (corresponds to Stata summarize
)
N <- 100 df <- tibble( id = 1:N, v1 = sample(5, N, TRUE), v2 = sample(1e6, N, TRUE) ) sum_up(df) df %>% sum_up(starts_with("v"), d = TRUE) df %>% group_by(v1) %>% sum_up()
tab
prints distinct rows with their count. Compared to the dplyr function count
, this command adds frequency, percent, and cumulative percent.
N <- 1e2 ; K = 10 df <- tibble( id = sample(c(NA,1:5), N/K, TRUE), v1 = sample(1:5, N/K, TRUE) ) tab(df, id) tab(df, id, na.rm = TRUE) tab(df, id, v1)
join
is a wrapper for dplyr merge functionalities, with two added functions
The option check
checks there are no duplicates in the master or using data.tables (as in Stata).
# merge m:1 v1 join(x, y, kind = "full", check = m~1)
The option gen
specifies the name of a new variable that identifies non matched and matched rows (as in Stata).
# merge m:1 v1, gen(_merge) join(x, y, kind = "full", gen = "_merge")
The option update
allows to update missing values of the master dataset by the value in the using dataset
# pctile computes quantile and weighted quantile of type 2 (similarly to Stata _pctile) v <- c(NA, 1:10) pctile(v, probs = c(0.3, 0.7), na.rm = TRUE) # xtile creates integer variable for quantile categories (corresponds to Stata xtile) v <- c(NA, 1:10) xtile(v, n_quantiles = 3) # 3 groups based on terciles xtile(v, probs = c(0.3, 0.7)) # 3 groups based on two quantiles xtile(v, cutpoints = c(2, 3)) # 3 groups based on two cutpoints # winsorize (default based on 5 x interquartile range) v <- c(1:4, 99) winsorize(v) winsorize(v, replace = NA) winsorize(v, probs = c(0.01, 0.99)) winsorize(v, cutpoints = c(1, 50))
The classes "monthly" and "quarterly" print as dates and are compatible with usual time extraction (ie month
, year
, etc). Yet, they are stored as integers representing the number of elapsed periods since 1970/01/0 (resp in week, months, quarters). This is particularly handy for simple algebra:
# elapsed dates library(lubridate) date <- mdy(c("04/03/1992", "01/04/1992", "03/15/1992")) datem <- as.monthly(date) # displays as a period datem #> [1] "1992m04" "1992m01" "1992m03" # behaves as an integer for numerical operations: datem + 1 #> [1] "1992m05" "1992m02" "1992m04" # behaves as a date for period extractions: year(datem) #> [1] 1992 1992 1992
tlag
/tlead
a vector with respect to a number of periods, not with respect to the number of rows
year <- c(1989, 1991, 1992) value <- c(4.1, 4.5, 3.3) tlag(value, 1, time = year) library(lubridate) date <- mdy(c("01/04/1992", "03/15/1992", "04/03/1992")) datem <- as.monthly(date) value <- c(4.1, 4.5, 3.3) tlag(value, time = datem)
In constrast to comparable functions in zoo
and xts
, these functions can be applied to any vector and be used within a dplyr
chain:
df <- tibble( id = c(1, 1, 1, 2, 2), year = c(1989, 1991, 1992, 1991, 1992), value = c(4.1, 4.5, 3.3, 3.2, 5.2) ) df %>% group_by(id) %>% mutate(value_l = tlag(value, time = year))
is.panel
checks whether a dataset is a panel i.e. the time variable is never missing and the combinations (id, time) are unique.
df <- tibble( id1 = c(1, 1, 1, 2, 2), id2 = 1:5, year = c(1991, 1993, NA, 1992, 1992), value = c(4.1, 4.5, 3.3, 3.2, 5.2) ) df %>% group_by(id1) %>% is.panel(year) df1 <- df %>% filter(!is.na(year)) df1 %>% is.panel(year) df1 %>% group_by(id1) %>% is.panel(year) df1 %>% group_by(id1, id2) %>% is.panel(year)
fill_gap transforms a unbalanced panel into a balanced panel. It corresponds to the stata command tsfill
. Missing observations are added as rows with missing values.
df <- tibble( id = c(1, 1, 1, 2), datem = as.monthly(mdy(c("04/03/1992", "01/04/1992", "03/15/1992", "05/11/1992"))), value = c(4.1, 4.5, 3.3, 3.2) ) df %>% group_by(id) %>% fill_gap(datem) df %>% group_by(id) %>% fill_gap(datem, full = TRUE) df %>% group_by(id) %>% fill_gap(datem, roll = "nearest")
stat_binmean()
(a stat
for ggplot2) returns the mean of y
and x
within 20 bins of x
. It's a barebone version of the Stata command binscatter
ggplot(iris, aes(x = Sepal.Width , y = Sepal.Length)) + stat_binmean() # change number of bins ggplot(iris, aes(x = Sepal.Width , y = Sepal.Length, color = Species)) + stat_binmean(n = 10) # add regression line ggplot(iris, aes(x = Sepal.Width , y = Sepal.Length, color = Species)) + stat_binmean() + stat_smooth(method = "lm", se = FALSE)
You can install
The latest released version from CRAN with
install.packages("statar")
The current version from github with
devtools::install_github("matthieugomez/statar")
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4