RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/Polkas/miceFast below:

GitHub - Polkas/miceFast: R enviroment

miceFast

Author: Maciej Nasinski

Check the miceFast website for more details

miceFast provides fast methods for imputing missing data, leveraging an object-oriented programming paradigm and optimized linear algebra routines.
The package includes convenient helper functions compatible with data.table, dplyr, and other popular R packages.

Major speed improvements occur when:

Using a grouping variable, where the data is automatically sorted by group, significantly reducing computation time.
Performing multiple imputations, by evaluating the underlying quantitative model only once for multiple draws.
Running Predictive Mean Matching (PMM), thanks to presorting and binary search.

For performance details, see performance_validity.R in the extdata folder.

You can install miceFast from CRAN:

install.packages("miceFast")

Or install the development version from GitHub:

# install.packages("devtools")
devtools::install_github("polkas/miceFast")

Below is a short demonstration. See the vignette for advanced usage and best practices.

library(miceFast)

set.seed(1234)
data(air_miss)

# Visualize the NA structure
upset_NA(air_miss, 6)

# Simple and naive fill
imputed_data <- naive_fill_NA(air_miss)

# Compare with other packages:
# Hmisc
library(Hmisc)
data.frame(Map(function(x) Hmisc::impute(x, "random"), air_miss))

# mice
library(mice)
mice::complete(mice::mice(air_miss, printFlag = FALSE))

Multiple imputations are performed in a loop where a continuous variable is imputed using a Bayesian linear model (lm_bayes) that incorporates relevant predictors and weights for robust estimation. Simultaneously, a categorical variable is imputed using linear discriminant analysis (LDA) augmented with a randomly generated ridge penalty.

library(dplyr)

# Define a function that performs the imputation on the dataset
impute_data <- function(data) {
  data %>%
    mutate(
      # Impute the continuous variable using lm_bayes
      Solar_R_imp = fill_NA(
        x = .,
        model = "lm_bayes",
        posit_y = "Solar.R",
        posit_x = c("Wind", "Temp", "Intercept"),
        w = weights  # assuming 'weights' is a column in data
      ),
      # Impute the categorical variable using lda with a random ridge parameter
      Ozone_chac_imp = fill_NA(
        x = .,
        model = "lda",
        posit_y = "Ozone_chac",
        posit_x = c("Wind", "Temp"),
        ridge = runif(1, 0, 50)
      )
    )
}

# Set seed for reproducibility
set.seed(123456)

# Run the imputation process 3 times using replicate()
# This returns a list of imputed datasets.
res <- replicate(n = 3, expr = impute_data(air_miss), simplify = FALSE)

# Check results: Calculate the mean of the imputed Solar.R values in each dataset
means_imputed <- lapply(res, function(x) mean(x$Solar_R_imp, na.rm = TRUE))
print(means_imputed)

# Check results: Tabulate the imputed categorical variable for each dataset
tables_imputed <- lapply(res, function(x) table(x$Ozone_chac_imp))
print(tables_imputed)

Object-Oriented Interface via miceFast objects (Rcpp modules).
Convenient Helpers:
- fill_NA(): Single imputation (lda, lm_pred, lm_bayes, lm_noise).
- fill_NA_N(): Multiple imputations (pmm, lm_bayes, lm_noise).
- VIF(): Variance Inflation Factor calculations.
- naive_fill_NA(): Automatic naive imputations.
- compare_imp(): Compare original vs. imputed values.
- upset_NA(): Visualize NA structure using UpSetR.

Quick Reference Table:

Function Description new(miceFast) Creates an OOP instance with numerous imputation methods (see the vignette). fill_NA() Single imputation: lda, lm_pred, lm_bayes, lm_noise. fill_NA_N() Multiple imputations (N repeats): pmm, lm_bayes, lm_noise. VIF() Computes Variance Inflation Factors. naive_fill_NA() Performs automatic, naive imputations. compare_imp() Compares imputations vs. original data. upset_NA() Visualizes NA structure using an UpSet plot.

Benchmark testing (on R 4.4.3, macOS M3 Pro, optimized BLAS and LAPACK) shows miceFast can significantly reduce computation time, especially in these scenarios:

Linear Discriminant Analysis (LDA): ~5x faster.
Grouping Variable Imputations: ~10x faster (and can exceed 100x in some edge cases).
Multiple Imputations: ~x * (number of multiple imputations) faster, since the model is computed only once.
Variance Inflation Factors (VIF): ~5x faster, because we only compute the inverse of X'X.
Predictive Mean Matching (PMM): ~3x faster, thanks to presorting and binary search.

For performance details, see performance_validity.R in the extdata folder.

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4