Determine the Type I error and Power rates for a t-test when different groups sizes are used. The independent t-test has three main assumptions:
The Welch test, on the other hand, only requires the first two assumptions. In the simulation below, we will also explore how robust the independent t-test is to this violation, and compare how well the Welch test behaves under the same conditions.
Post-hoc Power AnalysisAs an aside, while following simulation explores the empirical Type I and power rates, the same logic can be used to evaluate so-called post-hoc power analyses in empirical research. In this specific application only the power results would be obtained given some fixed sample size (the sample size used in the original research), population effect size(s), assumed random variable distributions, and \(\alpha\) cut-off to obtain the empirical estimate of the power, \(1-\beta\). This is primarily useful in evaluating whether or not a published statistical test had a fair chance of rejecting an incorrect null hypothesis.
Alternatively, substituting in estimates of the associated effect size from a research study, however, will result in (highly-questionable) estimates of observed power. In general, observed power is of little use in applied research as it conflates population generating parameters with sub-optimal empirical estimates of said parameters, which can be particularly problematic in smaller samples size when the difference between the parameter and associated parameter estimate is large.
Step1 1: Define conditionsFirst, define the condition combinations that should be investigated. In this case we know that the size and ratio of the sample sizes within each group, as well as the respective group standard deviations, will influence the detection rates of the independent t-test. Here we study define three different sample sizes and standard deviations, and investigate their effect when completely crossed.
library(SimDesign)
# SimFunctions()
Design <- createDesign(sample_size = c(30, 60, 90, 120),
group_size_ratio = c(1/2, 1, 2),
standard_deviation_ratio = c(1, 4, 8),
mean_difference = c(0, 0.5))
dim(Design)
# A tibble: 3 × 4
sample_size group_size_ratio standard_deviation_ratio mean_difference
<dbl> <dbl> <dbl> <dbl>
1 30 0.5 1 0
2 60 0.5 1 0
3 90 0.5 1 0
# A tibble: 3 × 4
sample_size group_size_ratio standard_deviation_ratio mean_difference
<dbl> <dbl> <dbl> <dbl>
1 60 2 8 0.5
2 90 2 8 0.5
3 120 2 8 0.5
Each row in Design
represents a unique condition to be studied in the simulation. In this case, the first condition to be studied comes from row 1, where the total sample size is \(N=30\), and the ratio of sample sizes within each group is different. In this case, the second group is half (0.5) the size of the first group, therefore \(N1 = 20\) and \(N2 = 10\). The standard deviation ratio is in this case 1 (i.e., they are exactly equal), and there is no mean difference between the groups. Analyzing rows with no mean difference will give us an idea of how the Type I error rates behave.
The last condition on row 72 indicates that the second group has a sample size twice as large as the first, and here \(N1 = 40\) and \(N2 = 80\). As well, group 2 has a standard deviation which is 8 times larger than the first group, and overall there is a mean difference of 0.5 (i.e., a power condition). Truth be told, however, the interpretations could be switched around as it really depends on how the Generate()
function utilizes these inputs (the interpretation could have been that group 1 was smaller when group_size_ratio == .5
rather than larger).
Begin by defining a data generation function for returning a data.frame
with two columns: the dependent variable, and a grouping variable (a factor
or otherwise).
Generate <- function(condition, fixed_objects = NULL){
Attach(condition) # makes all elements in condition available
if(group_size_ratio < 1){
N2 <- sample_size / (1/group_size_ratio + 1)
N1 <- sample_size - N2
} else {
N1 <- sample_size / (group_size_ratio + 1)
N2 <- sample_size - N1
}
group1 <- rnorm(N1)
group2 <- rnorm(N2, mean=mean_difference, sd=standard_deviation_ratio)
dat <- data.frame(group = c(rep('g1', N1), rep('g2', N2)), DV = c(group1, group2))
dat
}
The Generate()
function defined above first extracts the elements from condition
for convenience, generates data for two groups using random normal draws, and returns a data.frame object containing the dependent variable and grouping information.
NOTE: if you are using RStudio with diagnostic flags on then use of Attach()
will result in false positive ‘missing’ variables as the IDE does not track the appropriate variable names (hence, you’ll see yellow caution variables). To fix these false positives without disabling such potentially useful diagnostic simply copy and paste the following output somewhere into your R script to suppress these messages only for specific variables rather than all variables in the analysis:
Attach(Design, RStudio_flags = TRUE)
# !diagnostics suppress=sample_size,group_size_ratio,standard_deviation_ratio,mean_difference
Continuing with the example, the Analyse()
definition below performs a Welch and independent \(t\)-test on the data generated from Generate()
, checks whether any errors occurred during the calls to the estimation functions, and extracts and returns the respective \(p\)-values for each statistical analysis.
Analyse <- function(condition, dat, fixed_objects = NULL){
welch <- t.test(DV ~ group, dat)
ind <- t.test(DV ~ group, dat, var.equal=TRUE)
ret <- c(welch=welch$p.value, independent=ind$p.value)
ret
}
Finally, after the number of replications defined in replications
has been completed, a matrix (or list, if the output of Analyse()
were a list instead of a vector) containing the overall results should be summerised into suitable meta-descriptions of the behaviour. This is accomplished with the Summarise()
function.
Summarise <- function(condition, results, fixed_objects = NULL){
ret <- EDR(results, alpha = .05)
ret
}
In the above function, the results
input is a matrix of size replications
by 2 containing the respective \(p\)-values returned from Analyse()
across the desired number of replications. It is here where empirical detection rates can be determined, where the convenience function EDR()
is available for just this purpose. Each column in the supplied matrix of \(p\)-values is checked to determine whether it is less that \(\alpha = .05\), and the average number of times this occurs is returned as a vector of length 2 with the column names associated. Note that names are preserved in the EDR()
function, therefore it is clear which element represents the Welch and Independent t-test rates.
In this case a few extra arguments are passed to runSimulation()
. To help speed things up, the use of parallel architecture is used by passing parallel = TRUE
, and messages printed to the console are disabled by passing verbose = FALSE
.
res <- runSimulation(design = Design, replications = 1000,
generate = Generate, analyse = Analyse, summarise = Summarise,
verbose = FALSE, parallel = TRUE)
Analyse the results
Usually it is a good idea to split up analyses into Type I error and power rates data, and to make design
conditions into factor
s if/where necessary. More specifically, Power rates really should only be interpreted when the Type I errors are well behaved (liberal detection rates naturally will result in more powerful tests, while conservative detection rates will induce lower power).
# A tibble: 72 × 10
sample_size group_size_ratio standard_deviation_ratio mean_difference welch
<dbl> <dbl> <dbl> <dbl> <dbl>
1 30 0.5 1 0 0.042
2 60 0.5 1 0 0.051
3 90 0.5 1 0 0.05
4 120 0.5 1 0 0.051
5 30 1 1 0 0.053
6 60 1 1 0 0.05
7 90 1 1 0 0.051
8 120 1 1 0 0.052
9 30 2 1 0 0.053
10 60 2 1 0 0.051
# ℹ 62 more rows
# ℹ 5 more variables: independent <dbl>, REPLICATIONS <dbl>, SIM_TIME <chr>,
# SEED <int>, COMPLETED <chr>
(TypeI <- subset(res, mean_difference == 0, select=sample_size:independent))
# A tibble: 36 × 6
sample_size group_size_ratio standard_deviation_ratio mean_difference welch
<dbl> <dbl> <dbl> <dbl> <dbl>
1 30 0.5 1 0 0.042
2 60 0.5 1 0 0.051
3 90 0.5 1 0 0.05
4 120 0.5 1 0 0.051
5 30 1 1 0 0.053
6 60 1 1 0 0.05
7 90 1 1 0 0.051
8 120 1 1 0 0.052
9 30 2 1 0 0.053
10 60 2 1 0 0.051
# ℹ 26 more rows
# ℹ 1 more variable: independent <dbl>
(Power <- subset(res, mean_difference != 0, select=sample_size:independent))
# A tibble: 36 × 6
sample_size group_size_ratio standard_deviation_ratio mean_difference welch
<dbl> <dbl> <dbl> <dbl> <dbl>
1 30 0.5 1 0.5 0.21
2 60 0.5 1 0.5 0.411
3 90 0.5 1 0.5 0.577
4 120 0.5 1 0.5 0.726
5 30 1 1 0.5 0.274
6 60 1 1 0.5 0.47
7 90 1 1 0.5 0.649
8 120 1 1 0.5 0.797
9 30 2 1 0.5 0.239
10 60 2 1 0.5 0.443
# ℹ 26 more rows
# ℹ 1 more variable: independent <dbl>
Lets start off by looking at some descriptive tables, and performing simple ANOVA analyses with effect sizes to determine where the differences generally occurred.
library(reshape2)
mlt <- melt(TypeI, id.vars = colnames(TypeI)[1:4])
out <- dcast(mlt, group_size_ratio + standard_deviation_ratio + sample_size ~ variable)
colnames(out) <- c('Group Ratio', 'SD Ratio', 'N', 'Welch', 'Independent')
emph <- suppressWarnings(which(out > .075 | out < .025, arr.ind = TRUE))
pander::pander(out, round=2, emphasize.strong.cells = emph)
0.5 1 30 0.04 0.04 0.5 1 60 0.05 0.05 0.5 1 90 0.05 0.05 0.5 1 120 0.05 0.05 0.5 4 30 0.06 0.16 0.5 4 60 0.07 0.18 0.5 4 90 0.05 0.16 0.5 4 120 0.06 0.16 0.5 8 30 0.06 0.17 0.5 8 60 0.05 0.14 0.5 8 90 0.06 0.18 0.5 8 120 0.05 0.17 1 1 30 0.05 0.05 1 1 60 0.05 0.05 1 1 90 0.05 0.05 1 1 120 0.05 0.05 1 4 30 0.05 0.05 1 4 60 0.05 0.05 1 4 90 0.05 0.05 1 4 120 0.04 0.04 1 8 30 0.05 0.05 1 8 60 0.05 0.06 1 8 90 0.05 0.05 1 8 120 0.04 0.04 2 1 30 0.05 0.05 2 1 60 0.05 0.04 2 1 90 0.05 0.05 2 1 120 0.04 0.04 2 4 30 0.05 0.01 2 4 60 0.05 0.01 2 4 90 0.06 0.01 2 4 120 0.05 0.01 2 8 30 0.05 0.01 2 8 60 0.05 0 2 8 90 0.04 0.01 2 8 120 0.05 0.01
out2 <- dcast(mlt, group_size_ratio + standard_deviation_ratio ~ variable, fun.aggregate = mean)
colnames(out2) <- c('Group Ratio', 'SD Ratio', 'Welch', 'Independent')
emph2 <- suppressWarnings(which(out2 > .075 | out2 < .025, arr.ind = TRUE))
pander::pander(out2, round=2, emphasize.strong.cells = emph2)
0.5 1 0.05 0.05 0.5 4 0.06 0.17 0.5 8 0.05 0.16 1 1 0.05 0.05 1 4 0.04 0.05 1 8 0.05 0.05 2 1 0.05 0.04 2 4 0.05 0.01 2 8 0.05 0.01
The descriptive results of the rates seem to be very favorable for the Welch test in that they are all close to the nominal detection rate. However, the independent \(t\)-test appears to have combinations where it is either liberal or conservative. Though we could simply inspect the table to see why/where this occurred, sometimes it is easier to use ANOVAs to help pinpoint where the sources of variation occurred.
The one catch with ANOVAs is that there must be within-group variability, therefore something in our design must be marginalized. In this case, we will marginalize sample size because Type I error rates really shouldn’t be influenced by \(N\). Inspecting the previous table as also indicates that this assumption is probably a good one.
SimAnova(independent ~ sample_size, TypeI)
SS df MS F p sig eta.sq eta.sq.part
sample_size 0.097 1 0.097 0.076 0.784 . 0.002 0.002
Residuals 43.265 34 1.272 NA NA 0.998 NA
SimAnova(independent ~ group_size_ratio * standard_deviation_ratio, TypeI)
SS df MS F p sig
group_size_ratio 28.673 1 28.673 254.12 0.000 ***
standard_deviation_ratio 0.331 1 0.331 2.94 0.096 .
group_size_ratio:standard_deviation_ratio 10.746 1 10.746 95.24 0.000 ***
Residuals 3.611 32 0.113 NA NA
eta.sq eta.sq.part
group_size_ratio 0.661 0.888
standard_deviation_ratio 0.008 0.084
group_size_ratio:standard_deviation_ratio 0.248 0.749
Residuals 0.083 NA
After examining the effect sizes for the independent t-test, there appears to be an interaction effect between group_size_ratio
and standard_deviation_ratio
. To help visualize this effect, we create box-plots while continuing to marginalize over the sample size (otherwise, we would have to use point estimates for scatter plots).
SimAnova(welch ~ group_size_ratio * standard_deviation_ratio, TypeI)
SS df MS F p sig eta.sq
group_size_ratio 0.034 1 0.034 2.328 0.137 . 0.068
standard_deviation_ratio 0.000 1 0.000 0.006 0.940 . 0.000
group_size_ratio:standard_deviation_ratio 0.001 1 0.001 0.094 0.761 . 0.003
Residuals 0.467 32 0.015 NA NA 0.929
eta.sq.part
group_size_ratio 0.068
standard_deviation_ratio 0.000
group_size_ratio:standard_deviation_ratio 0.003
Residuals NA
library(ggplot2)
# Welch
gg <- ggplot(TypeI, aes(factor(group_size_ratio), welch, fill = factor(standard_deviation_ratio)))
gg + geom_boxplot() + facet_wrap(~standard_deviation_ratio) + ylim(c(0,0.2)) +
geom_hline(yintercept = .075, colour = 'red') + geom_hline(yintercept = .025, colour = 'red')
# Independent
gg <- ggplot(TypeI, aes(factor(group_size_ratio), independent, fill = factor(standard_deviation_ratio)))
gg + geom_boxplot() + facet_wrap(~standard_deviation_ratio) + ylim(c(0,0.2)) +
geom_hline(yintercept = .075, colour = 'red') + geom_hline(yintercept = .025, colour = 'red')
What is immediately obvious from these figures is that the Welch test performs fairly well across all conditions for controlling Type I error rates, while the independent \(t\)-test is heavily influenced by different combinations. When the variances were equal (left facet plot), the independent \(t\)-test appeared to perform well, and this should not be surprising because all the assumptions were met. As well, when the sample sizes were equal the test seemed to perform well across the board.
However, when the second group contained a larger variance AND a larger sample size than the first group the detection rates became more conservative. The converse was true when larger variances were paired with smaller sample sizes compared to the first group, where the detection rates became progressively more liberal as both the ratio of sample sizes increase and the ratio of standard deviations increased.
An alternative approachThe following is the same simulation design as above, however the approach is slightly different. Because the design
input object is a data.frame
object it is technically possible to include list
s as factors as well to be passed to createDesign()
, thereby creating a fully-crossed simulation design with the list
inputs. This may be preferable to some users if, for example, you would rather explicitly include grouping distributions in design
rather than using scalars like 1/2, 1, and 2.
Consider the following design definition:
library(SimDesign)
# SimFunctions()
Design <- createDesign(sample_size = c(30, 60, 90, 120),
group_sizes = list(c(1/3, 2/3),
c(1/2, 1/2),
c(2/3, 1/3)),
standard_deviations = list(c(1, 1),
c(1, 4),
c(1, 8)),
means = list(c(0, 0), c(0, 0.5)))
Design
# A tibble: 72 × 4
sample_size group_sizes standard_deviations means
<dbl> <list> <list> <list>
1 30 <dbl [2]> <dbl [2]> <dbl [2]>
2 60 <dbl [2]> <dbl [2]> <dbl [2]>
3 90 <dbl [2]> <dbl [2]> <dbl [2]>
4 120 <dbl [2]> <dbl [2]> <dbl [2]>
5 30 <dbl [2]> <dbl [2]> <dbl [2]>
6 60 <dbl [2]> <dbl [2]> <dbl [2]>
7 90 <dbl [2]> <dbl [2]> <dbl [2]>
8 120 <dbl [2]> <dbl [2]> <dbl [2]>
9 30 <dbl [2]> <dbl [2]> <dbl [2]>
10 60 <dbl [2]> <dbl [2]> <dbl [2]>
# ℹ 62 more rows
Notice that there are still 72 conditions to be investigated, however now each element in Design$group_sizes
, Design$standard_deviation
, and Design$means
are vectors rather than a scalars. E.g.,
# A tibble: 1 × 4
sample_size group_sizes standard_deviations means
<dbl> <list> <list> <list>
1 30 <dbl [2]> <dbl [2]> <dbl [2]>
Hence, all elements are explicitly defined for each group rather than indexed and manipulated as in the previous example. The remainder of the simulation code is as follows (not that in this case only the Generate()
function needed to be modified):
Generate <- function(condition, fixed_objects = NULL){
N <- with(condition, sample_size)
Ns <- N * with(condition, group_sizes[[1]])
N1 <- Ns[1]; N2 <- Ns[2]
sd <- with(condition, standard_deviations[[1]])
m <- with(condition, means[[1]])
group1 <- rnorm(N1, mean=m[1], sd=sd[1])
group2 <- rnorm(N2, mean=m[2], sd=sd[2])
dat <- data.frame(group = c(rep('g1', N1), rep('g2', N2)), DV = c(group1, group2))
dat
}
Analyse <- function(condition, dat, fixed_objects = NULL){
welch <- t.test(DV ~ group, dat)
ind <- t.test(DV ~ group, dat, var.equal=TRUE)
ret <- c(welch=welch$p.value, independent=ind$p.value)
ret
}
Summarise <- function(condition, results, fixed_objects = NULL){
ret <- EDR(results, alpha = .05)
ret
}
#---------------------------------------------------
res <- runSimulation(design = Design, replications = 1000, verbose = FALSE, parallel=TRUE,
generate = Generate, analyse = Analyse, summarise = Summarise)
res
# A tibble: 72 × 10
sample_size group_sizes standard_deviations means welch independent
<dbl> <list> <list> <list> <dbl> <dbl>
1 30 <dbl [2]> <dbl [2]> <dbl [2]> 0.048 0.047
2 60 <dbl [2]> <dbl [2]> <dbl [2]> 0.052 0.051
3 90 <dbl [2]> <dbl [2]> <dbl [2]> 0.041 0.044
4 120 <dbl [2]> <dbl [2]> <dbl [2]> 0.047 0.051
5 30 <dbl [2]> <dbl [2]> <dbl [2]> 0.046 0.046
6 60 <dbl [2]> <dbl [2]> <dbl [2]> 0.041 0.041
7 90 <dbl [2]> <dbl [2]> <dbl [2]> 0.049 0.049
8 120 <dbl [2]> <dbl [2]> <dbl [2]> 0.067 0.067
9 30 <dbl [2]> <dbl [2]> <dbl [2]> 0.052 0.061
10 60 <dbl [2]> <dbl [2]> <dbl [2]> 0.056 0.052
# ℹ 62 more rows
# ℹ 4 more variables: REPLICATIONS <dbl>, SIM_TIME <chr>, SEED <int>,
# COMPLETED <chr>
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4