A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from http://philchalmers.github.io/SimDesign/html/02-TypeI_and_Power_simulation.html below:

Type I error rates and Power rate estimates

Determine the Type I error and Power rates for a t-test when different groups sizes are used. The independent t-test has three main assumptions:

The Welch test, on the other hand, only requires the first two assumptions. In the simulation below, we will also explore how robust the independent t-test is to this violation, and compare how well the Welch test behaves under the same conditions.

Post-hoc Power Analysis

As an aside, while following simulation explores the empirical Type I and power rates, the same logic can be used to evaluate so-called post-hoc power analyses in empirical research. In this specific application only the power results would be obtained given some fixed sample size (the sample size used in the original research), population effect size(s), assumed random variable distributions, and \(\alpha\) cut-off to obtain the empirical estimate of the power, \(1-\beta\). This is primarily useful in evaluating whether or not a published statistical test had a fair chance of rejecting an incorrect null hypothesis.

Alternatively, substituting in estimates of the associated effect size from a research study, however, will result in (highly-questionable) estimates of observed power. In general, observed power is of little use in applied research as it conflates population generating parameters with sub-optimal empirical estimates of said parameters, which can be particularly problematic in smaller samples size when the difference between the parameter and associated parameter estimate is large.

Step1 1: Define conditions

First, define the condition combinations that should be investigated. In this case we know that the size and ratio of the sample sizes within each group, as well as the respective group standard deviations, will influence the detection rates of the independent t-test. Here we study define three different sample sizes and standard deviations, and investigate their effect when completely crossed.

library(SimDesign)
# SimFunctions()

Design <- createDesign(sample_size = c(30, 60, 90, 120),
                       group_size_ratio = c(1/2, 1, 2),
                       standard_deviation_ratio = c(1, 4, 8),
                       mean_difference = c(0, 0.5))
dim(Design)
# A tibble: 3 × 4
  sample_size group_size_ratio standard_deviation_ratio mean_difference
        <dbl>            <dbl>                    <dbl>           <dbl>
1          30              0.5                        1               0
2          60              0.5                        1               0
3          90              0.5                        1               0
# A tibble: 3 × 4
  sample_size group_size_ratio standard_deviation_ratio mean_difference
        <dbl>            <dbl>                    <dbl>           <dbl>
1          60                2                        8             0.5
2          90                2                        8             0.5
3         120                2                        8             0.5

Each row in Design represents a unique condition to be studied in the simulation. In this case, the first condition to be studied comes from row 1, where the total sample size is \(N=30\), and the ratio of sample sizes within each group is different. In this case, the second group is half (0.5) the size of the first group, therefore \(N1 = 20\) and \(N2 = 10\). The standard deviation ratio is in this case 1 (i.e., they are exactly equal), and there is no mean difference between the groups. Analyzing rows with no mean difference will give us an idea of how the Type I error rates behave.

The last condition on row 72 indicates that the second group has a sample size twice as large as the first, and here \(N1 = 40\) and \(N2 = 80\). As well, group 2 has a standard deviation which is 8 times larger than the first group, and overall there is a mean difference of 0.5 (i.e., a power condition). Truth be told, however, the interpretations could be switched around as it really depends on how the Generate() function utilizes these inputs (the interpretation could have been that group 1 was smaller when group_size_ratio == .5 rather than larger).

Step 2: Define the functions

Begin by defining a data generation function for returning a data.frame with two columns: the dependent variable, and a grouping variable (a factor or otherwise).

Generate <- function(condition, fixed_objects = NULL){
    Attach(condition) # makes all elements in condition available
    if(group_size_ratio < 1){
        N2 <- sample_size / (1/group_size_ratio + 1)
        N1 <- sample_size - N2
    } else {
        N1 <- sample_size / (group_size_ratio + 1)
        N2 <- sample_size - N1
    }
    group1 <- rnorm(N1)
    group2 <- rnorm(N2, mean=mean_difference, sd=standard_deviation_ratio)
    dat <- data.frame(group = c(rep('g1', N1), rep('g2', N2)), DV = c(group1, group2))
    dat
}

The Generate() function defined above first extracts the elements from condition for convenience, generates data for two groups using random normal draws, and returns a data.frame object containing the dependent variable and grouping information.

NOTE: if you are using RStudio with diagnostic flags on then use of Attach() will result in false positive ‘missing’ variables as the IDE does not track the appropriate variable names (hence, you’ll see yellow caution variables). To fix these false positives without disabling such potentially useful diagnostic simply copy and paste the following output somewhere into your R script to suppress these messages only for specific variables rather than all variables in the analysis:

Attach(Design, RStudio_flags = TRUE)
# !diagnostics suppress=sample_size,group_size_ratio,standard_deviation_ratio,mean_difference

Continuing with the example, the Analyse() definition below performs a Welch and independent \(t\)-test on the data generated from Generate(), checks whether any errors occurred during the calls to the estimation functions, and extracts and returns the respective \(p\)-values for each statistical analysis.

Analyse <- function(condition, dat, fixed_objects = NULL){
    welch <- t.test(DV ~ group, dat)
    ind <- t.test(DV ~ group, dat, var.equal=TRUE)
    ret <- c(welch=welch$p.value, independent=ind$p.value)
    ret
}

Finally, after the number of replications defined in replications has been completed, a matrix (or list, if the output of Analyse() were a list instead of a vector) containing the overall results should be summerised into suitable meta-descriptions of the behaviour. This is accomplished with the Summarise() function.

Summarise <- function(condition, results, fixed_objects = NULL){
    ret <- EDR(results, alpha = .05)
    ret
}

In the above function, the results input is a matrix of size replications by 2 containing the respective \(p\)-values returned from Analyse() across the desired number of replications. It is here where empirical detection rates can be determined, where the convenience function EDR() is available for just this purpose. Each column in the supplied matrix of \(p\)-values is checked to determine whether it is less that \(\alpha = .05\), and the average number of times this occurs is returned as a vector of length 2 with the column names associated. Note that names are preserved in the EDR() function, therefore it is clear which element represents the Welch and Independent t-test rates.

Step 3: Run the simulation

In this case a few extra arguments are passed to runSimulation(). To help speed things up, the use of parallel architecture is used by passing parallel = TRUE, and messages printed to the console are disabled by passing verbose = FALSE.

res <- runSimulation(design = Design, replications = 1000, 
                     generate = Generate, analyse = Analyse, summarise = Summarise,
                     verbose = FALSE, parallel = TRUE)
Analyse the results

Usually it is a good idea to split up analyses into Type I error and power rates data, and to make design conditions into factors if/where necessary. More specifically, Power rates really should only be interpreted when the Type I errors are well behaved (liberal detection rates naturally will result in more powerful tests, while conservative detection rates will induce lower power).

# A tibble: 72 × 10
   sample_size group_size_ratio standard_deviation_ratio mean_difference welch
         <dbl>            <dbl>                    <dbl>           <dbl> <dbl>
 1          30              0.5                        1               0 0.042
 2          60              0.5                        1               0 0.051
 3          90              0.5                        1               0 0.05 
 4         120              0.5                        1               0 0.051
 5          30              1                          1               0 0.053
 6          60              1                          1               0 0.05 
 7          90              1                          1               0 0.051
 8         120              1                          1               0 0.052
 9          30              2                          1               0 0.053
10          60              2                          1               0 0.051
# ℹ 62 more rows
# ℹ 5 more variables: independent <dbl>, REPLICATIONS <dbl>, SIM_TIME <chr>,
#   SEED <int>, COMPLETED <chr>
(TypeI <- subset(res, mean_difference == 0, select=sample_size:independent))
# A tibble: 36 × 6
   sample_size group_size_ratio standard_deviation_ratio mean_difference welch
         <dbl>            <dbl>                    <dbl>           <dbl> <dbl>
 1          30              0.5                        1               0 0.042
 2          60              0.5                        1               0 0.051
 3          90              0.5                        1               0 0.05 
 4         120              0.5                        1               0 0.051
 5          30              1                          1               0 0.053
 6          60              1                          1               0 0.05 
 7          90              1                          1               0 0.051
 8         120              1                          1               0 0.052
 9          30              2                          1               0 0.053
10          60              2                          1               0 0.051
# ℹ 26 more rows
# ℹ 1 more variable: independent <dbl>
(Power <- subset(res, mean_difference != 0, select=sample_size:independent))
# A tibble: 36 × 6
   sample_size group_size_ratio standard_deviation_ratio mean_difference welch
         <dbl>            <dbl>                    <dbl>           <dbl> <dbl>
 1          30              0.5                        1             0.5 0.21 
 2          60              0.5                        1             0.5 0.411
 3          90              0.5                        1             0.5 0.577
 4         120              0.5                        1             0.5 0.726
 5          30              1                          1             0.5 0.274
 6          60              1                          1             0.5 0.47 
 7          90              1                          1             0.5 0.649
 8         120              1                          1             0.5 0.797
 9          30              2                          1             0.5 0.239
10          60              2                          1             0.5 0.443
# ℹ 26 more rows
# ℹ 1 more variable: independent <dbl>

Lets start off by looking at some descriptive tables, and performing simple ANOVA analyses with effect sizes to determine where the differences generally occurred.

library(reshape2)
mlt <- melt(TypeI, id.vars = colnames(TypeI)[1:4])
out <- dcast(mlt, group_size_ratio + standard_deviation_ratio + sample_size ~ variable)
colnames(out) <- c('Group Ratio', 'SD Ratio', 'N', 'Welch', 'Independent')
emph <- suppressWarnings(which(out > .075 | out < .025, arr.ind = TRUE))
pander::pander(out, round=2, emphasize.strong.cells = emph)
0.5 1 30 0.04 0.04 0.5 1 60 0.05 0.05 0.5 1 90 0.05 0.05 0.5 1 120 0.05 0.05 0.5 4 30 0.06 0.16 0.5 4 60 0.07 0.18 0.5 4 90 0.05 0.16 0.5 4 120 0.06 0.16 0.5 8 30 0.06 0.17 0.5 8 60 0.05 0.14 0.5 8 90 0.06 0.18 0.5 8 120 0.05 0.17 1 1 30 0.05 0.05 1 1 60 0.05 0.05 1 1 90 0.05 0.05 1 1 120 0.05 0.05 1 4 30 0.05 0.05 1 4 60 0.05 0.05 1 4 90 0.05 0.05 1 4 120 0.04 0.04 1 8 30 0.05 0.05 1 8 60 0.05 0.06 1 8 90 0.05 0.05 1 8 120 0.04 0.04 2 1 30 0.05 0.05 2 1 60 0.05 0.04 2 1 90 0.05 0.05 2 1 120 0.04 0.04 2 4 30 0.05 0.01 2 4 60 0.05 0.01 2 4 90 0.06 0.01 2 4 120 0.05 0.01 2 8 30 0.05 0.01 2 8 60 0.05 0 2 8 90 0.04 0.01 2 8 120 0.05 0.01
out2 <- dcast(mlt, group_size_ratio + standard_deviation_ratio ~ variable, fun.aggregate = mean)
colnames(out2) <- c('Group Ratio', 'SD Ratio', 'Welch', 'Independent')
emph2 <- suppressWarnings(which(out2 > .075 | out2 < .025, arr.ind = TRUE))
pander::pander(out2, round=2, emphasize.strong.cells = emph2)
0.5 1 0.05 0.05 0.5 4 0.06 0.17 0.5 8 0.05 0.16 1 1 0.05 0.05 1 4 0.04 0.05 1 8 0.05 0.05 2 1 0.05 0.04 2 4 0.05 0.01 2 8 0.05 0.01

The descriptive results of the rates seem to be very favorable for the Welch test in that they are all close to the nominal detection rate. However, the independent \(t\)-test appears to have combinations where it is either liberal or conservative. Though we could simply inspect the table to see why/where this occurred, sometimes it is easier to use ANOVAs to help pinpoint where the sources of variation occurred.

The one catch with ANOVAs is that there must be within-group variability, therefore something in our design must be marginalized. In this case, we will marginalize sample size because Type I error rates really shouldn’t be influenced by \(N\). Inspecting the previous table as also indicates that this assumption is probably a good one.

SimAnova(independent ~ sample_size, TypeI)
                SS df    MS     F     p sig eta.sq eta.sq.part
sample_size  0.097  1 0.097 0.076 0.784   .  0.002       0.002
Residuals   43.265 34 1.272    NA    NA      0.998          NA
SimAnova(independent ~ group_size_ratio * standard_deviation_ratio, TypeI)
                                              SS df     MS      F     p sig
group_size_ratio                          28.673  1 28.673 254.12 0.000 ***
standard_deviation_ratio                   0.331  1  0.331   2.94 0.096   .
group_size_ratio:standard_deviation_ratio 10.746  1 10.746  95.24 0.000 ***
Residuals                                  3.611 32  0.113     NA    NA    
                                          eta.sq eta.sq.part
group_size_ratio                           0.661       0.888
standard_deviation_ratio                   0.008       0.084
group_size_ratio:standard_deviation_ratio  0.248       0.749
Residuals                                  0.083          NA

After examining the effect sizes for the independent t-test, there appears to be an interaction effect between group_size_ratio and standard_deviation_ratio. To help visualize this effect, we create box-plots while continuing to marginalize over the sample size (otherwise, we would have to use point estimates for scatter plots).

SimAnova(welch ~ group_size_ratio * standard_deviation_ratio, TypeI)
                                             SS df    MS     F     p sig eta.sq
group_size_ratio                          0.034  1 0.034 2.328 0.137   .  0.068
standard_deviation_ratio                  0.000  1 0.000 0.006 0.940   .  0.000
group_size_ratio:standard_deviation_ratio 0.001  1 0.001 0.094 0.761   .  0.003
Residuals                                 0.467 32 0.015    NA    NA      0.929
                                          eta.sq.part
group_size_ratio                                0.068
standard_deviation_ratio                        0.000
group_size_ratio:standard_deviation_ratio       0.003
Residuals                                          NA
library(ggplot2)

# Welch
gg <- ggplot(TypeI, aes(factor(group_size_ratio), welch, fill = factor(standard_deviation_ratio)))
gg + geom_boxplot() + facet_wrap(~standard_deviation_ratio) + ylim(c(0,0.2)) + 
    geom_hline(yintercept = .075, colour = 'red') + geom_hline(yintercept = .025, colour = 'red')

# Independent
gg <- ggplot(TypeI, aes(factor(group_size_ratio), independent, fill = factor(standard_deviation_ratio)))
gg + geom_boxplot() + facet_wrap(~standard_deviation_ratio) + ylim(c(0,0.2)) + 
    geom_hline(yintercept = .075, colour = 'red') + geom_hline(yintercept = .025, colour = 'red')

What is immediately obvious from these figures is that the Welch test performs fairly well across all conditions for controlling Type I error rates, while the independent \(t\)-test is heavily influenced by different combinations. When the variances were equal (left facet plot), the independent \(t\)-test appeared to perform well, and this should not be surprising because all the assumptions were met. As well, when the sample sizes were equal the test seemed to perform well across the board.

However, when the second group contained a larger variance AND a larger sample size than the first group the detection rates became more conservative. The converse was true when larger variances were paired with smaller sample sizes compared to the first group, where the detection rates became progressively more liberal as both the ratio of sample sizes increase and the ratio of standard deviations increased.

An alternative approach

The following is the same simulation design as above, however the approach is slightly different. Because the design input object is a data.frame object it is technically possible to include lists as factors as well to be passed to createDesign(), thereby creating a fully-crossed simulation design with the list inputs. This may be preferable to some users if, for example, you would rather explicitly include grouping distributions in design rather than using scalars like 1/2, 1, and 2.

Consider the following design definition:

library(SimDesign)
# SimFunctions()

Design <- createDesign(sample_size = c(30, 60, 90, 120),
                       group_sizes = list(c(1/3, 2/3),
                                            c(1/2, 1/2),
                                            c(2/3, 1/3)),
                       standard_deviations = list(c(1, 1),
                                               c(1, 4),
                                               c(1, 8)),
                       means = list(c(0, 0), c(0, 0.5)))
Design
# A tibble: 72 × 4
   sample_size group_sizes standard_deviations means    
         <dbl> <list>      <list>              <list>   
 1          30 <dbl [2]>   <dbl [2]>           <dbl [2]>
 2          60 <dbl [2]>   <dbl [2]>           <dbl [2]>
 3          90 <dbl [2]>   <dbl [2]>           <dbl [2]>
 4         120 <dbl [2]>   <dbl [2]>           <dbl [2]>
 5          30 <dbl [2]>   <dbl [2]>           <dbl [2]>
 6          60 <dbl [2]>   <dbl [2]>           <dbl [2]>
 7          90 <dbl [2]>   <dbl [2]>           <dbl [2]>
 8         120 <dbl [2]>   <dbl [2]>           <dbl [2]>
 9          30 <dbl [2]>   <dbl [2]>           <dbl [2]>
10          60 <dbl [2]>   <dbl [2]>           <dbl [2]>
# ℹ 62 more rows

Notice that there are still 72 conditions to be investigated, however now each element in Design$group_sizes, Design$standard_deviation, and Design$means are vectors rather than a scalars. E.g.,

# A tibble: 1 × 4
  sample_size group_sizes standard_deviations means    
        <dbl> <list>      <list>              <list>   
1          30 <dbl [2]>   <dbl [2]>           <dbl [2]>

Hence, all elements are explicitly defined for each group rather than indexed and manipulated as in the previous example. The remainder of the simulation code is as follows (not that in this case only the Generate() function needed to be modified):

Generate <- function(condition, fixed_objects = NULL){
    N <- with(condition, sample_size)
    Ns <- N * with(condition, group_sizes[[1]])
    N1 <- Ns[1]; N2 <- Ns[2]
    sd <- with(condition, standard_deviations[[1]])
    m <- with(condition, means[[1]])
    group1 <- rnorm(N1, mean=m[1], sd=sd[1])
    group2 <- rnorm(N2, mean=m[2], sd=sd[2])
    dat <- data.frame(group = c(rep('g1', N1), rep('g2', N2)), DV = c(group1, group2))
    dat
}

Analyse <- function(condition, dat, fixed_objects = NULL){
    welch <- t.test(DV ~ group, dat)
    ind <- t.test(DV ~ group, dat, var.equal=TRUE)
    ret <- c(welch=welch$p.value, independent=ind$p.value)
    ret
}

Summarise <- function(condition, results, fixed_objects = NULL){
    ret <- EDR(results, alpha = .05)
    ret
}

#---------------------------------------------------

res <- runSimulation(design = Design, replications = 1000, verbose = FALSE, parallel=TRUE,
                     generate = Generate, analyse = Analyse, summarise = Summarise)
res
# A tibble: 72 × 10
   sample_size group_sizes standard_deviations means     welch independent
         <dbl> <list>      <list>              <list>    <dbl>       <dbl>
 1          30 <dbl [2]>   <dbl [2]>           <dbl [2]> 0.048       0.047
 2          60 <dbl [2]>   <dbl [2]>           <dbl [2]> 0.052       0.051
 3          90 <dbl [2]>   <dbl [2]>           <dbl [2]> 0.041       0.044
 4         120 <dbl [2]>   <dbl [2]>           <dbl [2]> 0.047       0.051
 5          30 <dbl [2]>   <dbl [2]>           <dbl [2]> 0.046       0.046
 6          60 <dbl [2]>   <dbl [2]>           <dbl [2]> 0.041       0.041
 7          90 <dbl [2]>   <dbl [2]>           <dbl [2]> 0.049       0.049
 8         120 <dbl [2]>   <dbl [2]>           <dbl [2]> 0.067       0.067
 9          30 <dbl [2]>   <dbl [2]>           <dbl [2]> 0.052       0.061
10          60 <dbl [2]>   <dbl [2]>           <dbl [2]> 0.056       0.052
# ℹ 62 more rows
# ℹ 4 more variables: REPLICATIONS <dbl>, SIM_TIME <chr>, SEED <int>,
#   COMPLETED <chr>

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4