A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from http://cran.rstudio.com/web/packages/uwot/../Rcpp/../RcppFastAD/../Rcpp/../aorsf/readme/README.html below:

README

aorsf

Fit, interpret, and make predictions with oblique random forests (RFs).

Why aorsf? Installation

You can install aorsf from CRAN using

You can install the development version of aorsf from GitHub with:

Get started

aorsf fits several types of oblique RFs with the orsf() function, including classification, regression, and survival RFs.

For classification, we fit an oblique RF to predict penguin species using penguin data from the magnificent palmerpenguins R package

For regression, we use the same data but predict bill length of penguins:

My personal favorite is the oblique survival RF with accelerated Cox regression because it was the first type of oblique RF that aorsf provided (see JCGS paper). Here, we use it to predict mortality risk following diagnosis of primary biliary cirrhosis:

What does “oblique” mean?

Decision trees are grown by splitting a set of training data into non-overlapping subsets, with the goal of having more similarity within the new subsets than between them. When subsets are created with a single predictor, the decision tree is axis-based because the subset boundaries are perpendicular to the axis of the predictor. When linear combinations (i.e., a weighted sum) of variables are used instead of a single variable, the tree is oblique because the boundaries are neither parallel nor perpendicular to the axis.

Figure: Decision trees for classification with axis-based splitting (left) and oblique splitting (right). Cases are orange squares; controls are purple circles. Both trees partition the predictor space defined by variables X1 and X2, but the oblique splits do a better job of separating the two classes.

So, how does this difference translate to real data, and how does it impact random forests comprising hundreds of axis-based or oblique trees? We will demonstrate this using the penguin data.3 We will also use this function to make several plots:

plot_decision_surface <- function(predictions, title, grid){
 
 # this is not a general function for plotting
 # decision surfaces. It just helps to minimize 
 # copying and pasting of code.
 
 class_preds <- bind_cols(grid, predictions) %>%
  pivot_longer(cols = c(Adelie,
                        Chinstrap,
                        Gentoo)) %>%
  group_by(flipper_length_mm, bill_length_mm) %>%
  arrange(desc(value)) %>%
  slice(1)
 
 cols <- c("darkorange", "purple", "cyan4")

 ggplot(class_preds, aes(bill_length_mm, flipper_length_mm)) +
  geom_contour_filled(aes(z = value, fill = name),
                      alpha = .25) +
  geom_point(data = penguins_orsf,
             aes(color = species, shape = species),
             alpha = 0.5) +
  scale_color_manual(values = cols) +
  scale_fill_manual(values = cols) +
  labs(x = "Bill length, mm",
       y = "Flipper length, mm") +
  theme_minimal() +
  scale_x_continuous(expand = c(0,0)) +
  scale_y_continuous(expand = c(0,0)) +
  theme(panel.grid = element_blank(),
        panel.border = element_rect(fill = NA),
        legend.position = '') + 
  labs(title = title)
 
}

We also use a grid of points for plotting decision surfaces:

We use orsf with mtry=1 to fit axis-based trees:

Next we use orsf_update to copy and modify the original model, expanding it to fit an oblique tree by using mtry=2 instead of mtry=1, and to include 500 trees instead of 1:

And now we have all we need to visualize decision surfaces using predictions from these four fits:

preds <- list(fit_axis_tree,
              fit_axis_forest,
              fit_oblique_tree,
              fit_oblique_forest) %>% 
 map(predict, new_data = grid, pred_type = 'prob')

titles <- c("Axis-based tree",
            "Axis-based forest",
            "Oblique tree",
            "Oblique forest")

plots <- map2(preds, titles, plot_decision_surface, grid = grid)

Figure: Axis-based and oblique decision surfaces from a single tree and an ensemble of 500 trees. Axis-based trees have boundaries perpendicular to predictor axes, whereas oblique trees can have boundaries that are neither parallel nor perpendicular to predictor axes. Axis-based forests tend to have ‘step-function’ decision boundaries, while oblique forests tend to have smooth decision boundaries.

Variable importance

The importance of individual predictor variables can be estimated in three ways using aorsf and can be used on any type of oblique RF. Also, variable importance functions always return a named character vector

You can supply your own R function to estimate out-of-bag error (see oob vignette) or to estimate out-of-bag variable importance (see orsf_vi examples)

Partial dependence (PD)

Partial dependence (PD) shows the expected prediction from a model as a function of a single predictor or multiple predictors. The expectation is marginalized over the values of all other predictors, giving something like a multivariable adjusted estimate of the model’s prediction.. You can use specific values for a predictor to compute PD or let aorsf pick reasonable values for you if you use pred_spec_auto():

The summary function, orsf_summarize_uni(), computes PD for as many variables as you ask it to, using sensible values.

For more on PD, see the vignette

Individual conditional expectations (ICE)

Unlike partial dependence, which shows the expected prediction as a function of one or multiple predictors, individual conditional expectations (ICE) show the prediction for an individual observation as a function of a predictor.

For more on ICE, see the vignette

Interaction scores

The orsf_vint() function computes a score for each possible interaction in a model based on PD using the method described in Greenwell et al, 2018.5 It can be slow for larger datasets, but substantial speedups occur by making use of multi-threading and restricting the search to a smaller set of predictors.

What do the values in score mean? These values are the average of the standard deviation of the standard deviation of PD in one variable conditional on the other variable. They should be interpreted relative to one another, i.e., a higher scoring interaction is more likely to reflect a real interaction between two variables than a lower scoring one.

Do these interaction scores make sense? Let’s test the top scoring and lowest scoring interactions using coxph().

Note: this is exploratory and not a true null hypothesis test. Why? Because we used the same data both to generate and to test the null hypothesis. We are not so much conducting statistical inference when we test these interactions with coxph as we are demonstrating the interaction scores that orsf_vint() provides are consistent with tests from other models.

Comparison to existing software

For survival analysis, comparisons between aorsf and existing software are presented in our JCGS paper. The paper:

References
  1. Jaeger BC, Long DL, Long DM, Sims M, Szychowski JM, Min Y, Mcclure LA, Howard G, Simon N (2019). “Oblique random survival forests.” The Annals of Applied Statistics, 13(3). doi:10.1214/19-aoas1261 https://doi.org/10.1214/19-aoas1261.
  2. Jaeger BC, Welden S, Lenoir K, Speiser JL, Segar MW, Pandey A, Pajewski NM (2023). “Accelerated and interpretable oblique random survival forests.” Journal of Computational and Graphical Statistics, 1-16. doi:10.1080/10618600.2023.2231048 https://doi.org/10.1080/10618600.2023.2231048.
  3. Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0, https://allisonhorst.github.io/palmerpenguins/.
  4. Menze, H B, Kelm, Michael B, Splitthoff, N D, Koethe, Ullrich, Hamprecht, A F (2011). “On oblique random forests.” In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2011, Athens, Greece, September 5-9, 2011, Proceedings, Part II 22, 453-469. Springer.
  5. Greenwell, M B, Boehmke, C B, McCarthy, J A (2018). “A simple and effective model-based variable importance measure.” arXiv preprint arXiv:1805.04755.
Funding

The developers of aorsf received financial support from the Center for Biomedical Informatics, Wake Forest University School of Medicine. We also received support from the National Center for Advancing Translational Sciences of the National Institutes of Health under Award Number UL1TR001420.

The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4