Package: RcppColMetric 0.1.0
Author: Xiurui Zhu
Modified: 2025-03-08 18:17:18
Compiled: 2025-03-08 18:17:53
The goal of RcppColMetric
is to efficiently compute metrics between various vectors and a common vector. This is common in data science, such as computing performance metrics between each feature and a common response. Rcpp
is used to efficiently iterate over vectors through compiled code. You may extend its utilities by providing custom metrics that fit into the framework.
You can install the released version of RcppColMetric
from CRAN with:
install.packages("RcppColMetric")
Alternatively, you can install the developmental version of RcppColMetric
from github with:
remotes::install_github("zhuxr11/RcppColMetric")
We use cats
from MASS
to illustrate the use of the package.
library(MASS) data(cats) print(head(cats)) #> Sex Bwt Hwt #> 1 F 2.0 7.0 #> 2 F 2.0 7.4 #> 3 F 2.0 9.5 #> 4 F 2.1 7.2 #> 5 F 2.1 7.3 #> 6 F 2.1 7.6
In binary classification modelling, it is a common practice to compute ROC-AUC of each feature (usually columns) against a common target. RcppColMetric
provides a much faster version than its commonly used counterparts, e.g. caTools::colAUC()
.
library(caTools) (col_auc_bench <- microbenchmark::microbenchmark( col_auc_r = caTools::colAUC(cats[, 2L:3L], cats[, 1L]), col_auc_cpp = col_auc(cats[, 2L:3L], cats[, 1L]), times = 100L, check = "identical" )) #> Unit: microseconds #> expr min lq mean median uq max neval #> col_auc_r 514.600 544.2005 623.3241 613.2515 666.1505 985.601 100 #> col_auc_cpp 200.401 222.2010 244.9679 236.0010 266.8010 391.501 100
As can be seen, the median speed of computation from RcppColMetric
is 2.599 times faster.
If there are multiple sets of features and responses, you may use the vectorized version col_auc_vec()
, which uses compiled code to speed up iterations and returns a list.
col_auc_vec(list(cats[, 2L:3L]), list(cats[, 1L])) #> [[1]] #> Bwt Hwt #> F vs. M 0.8338451 0.759048Column-wize mutual information
In classification modelling, it is another common practice to assess mutual information between features and a response if the features are discrete. RcppColMetric
provides a much faster version than its commonly used counterparts, e.g. infotheo::mutinformation()
.
library(infotheo) (col_mut_info_bench <- microbenchmark::microbenchmark( col_mut_info_r = sapply(round(cats[, 2L:3L]), infotheo::mutinformation, cats[, 1L]) %>% {matrix(., nrow = 1L, dimnames = list(NULL, names(.)))}, col_mut_info_cpp = col_mut_info(round(cats[, 2L:3L]), cats[, 1L]), times = 100L, check = "identical" )) #> Unit: microseconds #> expr min lq mean median uq max neval #> col_mut_info_r 1587.200 1685.351 1884.131 1766.001 1964.9510 4803.800 100 #> col_mut_info_cpp 618.401 641.551 691.325 659.651 721.6015 943.101 100
As can be seen, the median speed of computation from RcppColMetric
is 2.677 times faster.
If there are multiple sets of features and responses, you may use the vectorized version col_mut_info_vec()
, which uses compiled code to speed up iterations and returns a list.
col_mut_info_vec(list(round(cats[, 2L:3L])), list(cats[, 1L])) #> [[1]] #> Bwt Hwt #> [1,] 0.1346783 0.1620514Extend the package with custom metric
You may implement your own metric by inheriting from RcppColMetric::Metric
class with template arguments as feature SEXP
(input, numeric here) and response SEXP
(input, factor as integer here) types. For example, to compute range of each feature, define a RangeMetric
class.
#include <RcppColMetric.h> #include <Rcpp.h> using namespace Rcpp; // x: numeric (REALSXP), y: factor -> integer (INTSXP), output: numeric (REALSXP) class RangeMetric: public RcppColMetric::Metric<REALSXP, INTSXP, REALSXP> { public: // Constructor RangeMetric(const RObject& x, const IntegerVector& y, const Nullable<List>& args = R_NilValue) { // This parameter is inherited from `Metric`, determining output dimension (number of rows) // For RangeMetric, the output dimension is 2 (min & max) output_dim = 2; } virtual Nullable<CharacterVector> row_names(const RObject& x, const IntegerVector& y, const Nullable<List>& args = R_NilValue) const override { // Determine the row names // If not used, it may return R_NilValue CharacterVector out = {"min", "max"}; return out; } virtual NumericVector calc_col(const NumericVector& x, const IntegerVector& y, const R_xlen_t& i, const Nullable<List>& args = R_NilValue) const override { // Derive output value for each feature and the common response // For RangeMetric, the output is min & max NumericVector out = {min(x), max(x)}; return out; } };
Then, define the main function calling RcppColMetric::col_metric()
, with corresponding feature SEXP
(input, numeric here), response SEXP
(input, factor as integer here) and output SEXP
(output, numeric here) types.
NumericMatrix col_range(const RObject& x, const IntegerVector& y, const Nullable<List>& args = R_NilValue) { RangeMetric range_metric(x, y, args); NumericMatrix out = RcppColMetric::col_metric<REALSXP, INTSXP, REALSXP>(x, y, range_metric, args); return out; }
Test this function with cats
:
col_range(cats[, 2L:3L], cats[, 1L]) #> Bwt Hwt #> min 2.0 6.3 #> max 3.9 20.5
To define vectorized version of the function, a wrapper function is defined to generate RangeMetric
object (taking only x
, y
and args
), and then passed on to the workhorse RcppColMetric::col_metric_vec()
.
RangeMetric gen_range_metric(const RObject& x, const IntegerVector& y, const Nullable<List>& args = R_NilValue) { RangeMetric out(x, y, args); return out; } // [[Rcpp::export]] List col_range_vec(const List& x, const List& y, const Nullable<List>& args = R_NilValue) { List out = RcppColMetric::col_metric_vec<REALSXP, INTSXP, REALSXP>(x, y, &gen_range_metric, args); return out; }
Test the vectorized function with cats
:
col_range_vec(list(cats[, 2L:3L]), list(cats[, 1L])) #> [[1]] #> Bwt Hwt #> min 2.0 6.3 #> max 3.9 20.5
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4