Background¶
When categorical data, it is common to do conduct tests such as the Tukey Honest Significant Differences (HSD). This is a powerful test which compares all pairs, but as the number of groups increases the number of comparisons increases quickly. The following is the method I use to visualize differences in categorical data.
Disclaimer: I am not claiming to have invented something new, this is just how I like to visualize this analysis, I am sure there are similar methods out there.
T-Test¶
Say you have two samples, and you want to determine if they come from the same population, i.e. are they "different". You could just compare their means and if they are different then you are good to go... right? Well, what if they are pretty close? How close is close enough?
To test this we have the t-test. We can test if two samples are significantly different from one another.
x <- rnorm(200, mean=18, sd=2) # generate normal distributed data
y <- rnorm(200, mean=22, sd=2) # generate different normal data
df <- data.frame(x=x, y=y) # put it in a data frame
df <- data.frame(melt(as.data.table(df))) # reformat the dataframe
# Plot
p <- ggplot(df, aes(x=value, fill=variable, color=variable)) +
geom_histogram(binwidth=1, alpha=0.5, position = "identity") +
ggtitle("Comparing means") + xlab("Value") + ylab("Frequency")
ggplotly(p, width=640, height=640)
#T Test
t.test(x,y)