Say you have two samples, and you want to determine if they come from the same population, i.e. are they "different". You could just compare their means and if they are different then you are good to go... right? Well what if they are pretty close? How close is close enough?
To test this we have the t-test. We can test if two samples are significantly different from one another.
x <- rnorm(200, mean=18, sd=2) # generate normal distributed data
y <- rnorm(200, mean=22, sd=2) # generate different normal data
df <- data.frame(x=x, y=y) # put it in a data frame
df <- data.frame(melt(as.data.table(df))) # reformat the dataframe
# Plot
p <- ggplot(df, aes(x=value, fill=variable, color=variable)) +
geom_histogram(binwidth=1, alpha=0.5, position = "identity") +
ggtitle("Comparing means") + xlab("Value") + ylab("Frequency")
ggplotly(p, width=640, height=640)
#T Test
t.test(x,y)
Welch Two Sample t-test data: x and y t = -20.552, df = 397.77, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -4.501817 -3.715759 sample estimates: mean of x mean of y 17.90109 22.00988
In this example we see that the two-sample T-Test confirms that these two samples $x$ and $y$ are different!
This is great, except that what if there are more than two samples? Then we can use Analysis of Variance (ANOVA) to determine if at least one of the samples is different.
x <- rnorm(200, mean=18, sd=2) # generate normal distributed data
y <- rnorm(200, mean=22, sd=3) # generate different normal data
z <- rnorm(200, mean=22, sd=1) # generate data with the same mean as y
df <- data.frame(x=x, y=y, z=z) # put it in a data frame
df <- data.frame(melt(as.data.table(df))) # reformat the dataframe
# Plot
p <- ggplot(df, aes(x=value, fill=variable, color=variable)) +
geom_histogram(binwidth=1, alpha=0.5, position = "identity") +
ggtitle("Comparing means") + xlab("Value") + ylab("Frequency")
ggplotly(p, width=640, height=640)
#ANOVA
summary(aov(df$value ~ df$variable))
Df Sum Sq Mean Sq F value Pr(>F) df$variable 2 2011 1005.7 196.5 <2e-16 *** Residuals 597 3055 5.1 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The result of this ANOVA shows that there is a significant difference somewhere in the data, but $y$ and $z$ were generated with the same mean! We need post-hoc analysis in order to determine which are different.
You may be tempted to use the t-test to compare each of the groups, however, with an error rate of 0.05 specified, after just 3 comparisons the error rate increases to 0.143. This is a big problem. As we add more groups and need to make more comparisons, the error rate of all of the tests together increases very quickly. This is called the experiment wise error rate (EWER). What we need is a test which maintains a specified error rate with all of the tests.
$$ EWER = 1 - (1 - \alpha)^{m} $$We can use the Tukey Honest Significant difference test to determine which groups are different and maintain a specified EWER. The following plot will show which are different. The comparisons are denoted on the y-axis and the difference in the means is denoted on the x-axis. The error bars on either side are determined by the Q-distribution. If the error bars cross the 0 line, then there is not as significant differnce. If they do not then there is.
plot(TukeyHSD(aov(df$value ~ df$variable)))
As we can see above, $x$ and $y$ are different, $x$ and $z$ are different, but $z$ and $y$ are not (as expected).
Obviously we don't HAVE to use a plot to conduct the Tukey HSD, but personally I like to be able to visualize the differences, I think it helps to understand the data. The issue with this plot, can be seen below.
# Create 10 groups
x <- rnorm(200, mean=18, sd=2) # generate normal distributed data
y <- rnorm(200, mean=22, sd=3) # generate different normal data
z <- rnorm(200, mean=22, sd=1) # generate data with the same mean as y
a <- rnorm(200, mean=21, sd=3)
b <- rnorm(200, mean=24, sd=2)
c <- rnorm(200, mean=22, sd=5)
d <- rnorm(200, mean=23, sd=4)
e <- rnorm(200, mean=28, sd=3)
f <- rnorm(200, mean=20, sd=2)
g <- rnorm(200, mean=20, sd=2)
df <- data.frame(x=x, y=y, z=z, a=a, b=b, c=c, d=d, e=e, f=f, g=g) # put it in a data frame
df <- data.frame(melt(as.data.table(df))) # reformat the dataframe
# Plot
p <- ggplot(df, aes(x=value, fill=variable, color=variable)) +
geom_histogram(binwidth=1, alpha=0.5, position = "identity") +
ggtitle("Comparing means") + xlab("Value") + ylab("Frequency")
ggplotly(p, width=640, height=640)
#ANOVA
summary(aov(df$value ~ df$variable))
#Tukey HSD
plot(TukeyHSD(aov(df$value ~ df$variable)), las=1)
Df Sum Sq Mean Sq F value Pr(>F) df$variable 9 11933 1325.8 161.7 <2e-16 *** Residuals 1990 16312 8.2 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The problem should be clear in the Tukey HSD plot. The number of comparisons is $\binom{N}{2}$ where $N$ is the number of groups. But as we add more groups the number of comparisons grows quickly and the data become harder to visualize.
Here is my solution to the problem, the Square-Q plot.
sq.q <- function(n, group, alpha=0.05) {
#accepts categorical data and alpha value, prints out simplified tukey plot
dataframe <- data.frame(group, n) # put in df
model <- aov(n ~ as.factor(group)) # get aov object
#get means of factors and merge to one dataframe
means <- dataframe %>% group_by(group) %>% summarise(n = mean(n))
table <- merge(x=count(dataframe, group), y=means, by=1)
MSE <- summary(model)[1][[1]][[3]][[2]] # get MSE
df <- summary(model)[1][[1]][[1]][[2]] # get deg.freedom
N <- nrow(table) # Number of factors
n_min <- min(table$n.x) # minimum sample size of a factor
#compute tukey range
table$tukey <- qtukey(1-alpha, N, df=df)*
sqrt(MSE/2 * (1/table$n.x + 1/n_min))
# Absolute differences
DIFF <- abs(outer(table$n.y, table$n.y, '-'))
#using tukey estimate above, find number of signficant differences
tab_tuk <- matrix(table$tukey, nrow=length(table$tukey),
ncol=length(table$tukey), byrow=TRUE)
diff_t <- (length(which((DIFF-tab_tuk) > 0)))
#calculate the actual pairwise tukey range
#and find number of signficant differnces
cal_tuk <- qtukey(1-alpha, N, df=df) *
sqrt(outer(1/table$n.x, 1/table$n.x, "+") * (MSE/2))
diff_c <- (length(which((DIFF-cal_tuk) > 0)))
title <- ""
subtitle <- ""
caption <- ""
#labeling and captioning based on accuracy. If n close enough all good
if ((diff_c - diff_t)%/%2 == 0) {
title=sprintf("%0.f%% Family-Wise Confidence Level",100*(1-alpha))
subtitle="Result accurately reflects Tukey HSD"
caption=sprintf("Tukey HSD finds %1.0f total signficiant differences", diff_c%/%2)
}
else {
title=sprintf("At least %0.f%% Family-Wise Confidence Level",100*(1-alpha))
subtitle=sprintf("Sample sizes not close enough, Result misses %1.0f significant differences compared to Tukey HSD.",(diff_c - diff_t)%/%2)
caption=sprintf("Tukey HSD finds %1.0f total signficiant differences", diff_c%/%2)
}
#tukey HSD plot, ordered smallest mean to largest
ggplot(table, aes(reorder(x=as.factor(group), n.y),
y=n.y, col=as.factor(group))) +
#plot tukey ranges
geom_crossbar(aes(ymin=n.y-tukey, ymax=n.y+tukey, fill=as.factor(group)),
width=0.22, alpha=0.6, linetype="blank") + #
# mean lines across
geom_hline(aes(yintercept=n.y, color=as.factor(group))) +
#add labels to end of mean lines, also add labels
geom_label_repel(aes(N+0.5, n.y, label = group), direction="y") +
xlab("Factors") + ylab("Means") + theme(legend.position="none") +
labs(title=title,
subtitle=subtitle,
caption=caption) +
scale_color_discrete() # difficult to read without color
}
sq.q(df$value, df$variable)
The above chart might look kind of intimidating at first, there is a lot going on and there are a lot of colors.
The bottom line for Tukey HSD is that the absolute difference between the means needs to be larger than the range specified by the Q distribution and scaled by the Mean Squared Error (MSE) and the sample sizes of the groups being compared.
$$ | \bar{X}_i - \bar{X}_j | > Q_{\alpha,k,N-k} \sqrt{\frac{MSE}{2}(\frac{1}{n_i} + \frac{1}{n_j})} $$That is where this graph can run into problems, if the sample sizes are widely different then for each comparison the range will be different. For this test it uses the minimum sample size for each comparison which guarantees that any difference found by this test will also be found by a Tukey HSD, however if the sample sizes are too widely different and the means are close to the edge of the range then there is potential for some differences to be missed.
$$ Q_{\alpha,k,N-k} \sqrt{\frac{MSE}{2}(\frac{1}{n_i} + \frac{1}{n_{min}})} \geq Q_{\alpha,k,N-k} \sqrt{\frac{MSE}{2}(\frac{1}{n_i} + \frac{1}{n_j})} $$but if all sample sizes $n_i$ are the same, then these quantities are equal.
This different approach to visualizing the the Tukey HSD, while not quite as precise as the Tukey HSD, provides valuable insight into categorical data and enables conclusions to be drawn quickly, and identifies areas which require further analysis.