Hypothesis Testing

At the heart of research lies a question. For example, consider the following scenario: you just went for a run in the park, and you feel great. Naturally, you might ask yourself "does exercise make people happy?". If you are asking a question that you don't know the answer to, research is necessary to resolve it. There are many forms that this research can take, from a literature review to performing an experiment. A technique known as statistical hypothesis testing is often used in psychology to determine a likely answer to a research question.

Formulating the hypotheses

With hypothesis testing, the research question is formulated as two competing hypotheses: the null hypothesis and the alternative hypothesis. The null hypothesis is the default position that the effect you are looking for does not exist, and the alternative hypothesis is that your prediction is correct. The goal of hypothesis testing is to collect evidence and reject the null hypothesis if it appears unlikely to be true. In other words, if we reject the null hypothesis there is some experimental support for the alternative hypothesis (although it is important to keep in mind that we have not proved the alternative hypothesis is true).

Here are the hypotheses for our example:

Null hypothesis
Physical exercise does not increase mood.
Alternative
hypothesis
Physical exercise increases mood.

Number of tails

Hypotheses can have a direction. In particular, a directional hypothesis not only states that an effect exists, but also states the direction of the effect. In the terminology of hypothesis testing, this is known as the number of tails of the hypothesis:

One tail
The hypothesis has an implied direction. The null hypothesis above is one-tailed, since it refers to an increase in mood.
Two tails
The hypothesis does not imply a direction. A two-tailed version of the null hypothesis above is "exercise has an impact on mood". In this case, we suspect there is a relationship between exercise and happiness, but we're not sure if the impact will be positive or negative.

Statistical significance

Due to naturally occuring variablilty, two seperate measurements (even of the same phenomenon) will almost always give different results. For example, assume I measure my happiness after a run on Monday, and I measure it again after a run on Wednesday. It would not be surprising if the results are different each time, since there are many factors that impact mood. Therefore, the goal of hypothesis testing is not to see if there is any difference between sets of measurements (there almost always will be), but rather to see if the differences are unlikely to be due to random variation. If so, we can say that our result is statistically significant. The general procedure is as follows:

  1. Compute a test statistic (e.g. a t-statistic). The test statistic is a single value that is sensitive to the difference between the null hypothesis and the alternative hypothesis.
  2. Use the sampling distribution of the test statistic (e.g. the t-distribution) to calculate a p-value. The p-value is the probability of obtaining a test statistic at least as large as the one observed.
  3. Reject the null hypothesis if the p-value is less than a predetermined threshold, which is known as the significance level. For example, if we use a significance level of 0.01 and obtain a p-value of 0.008, we reject the null hypothesis and say that the result is statistically significant.

Errors

The goal of hypothesis testing is to select either the null hypothesis or the alternative hypothesis. However, no matter how careful you are with your experimental design, there is always a non-zero probability that you will come to the incorrect conclusion. There are two possible errors, depending on which hypothesis is actually true:

Type I error
Falsely rejecting the null hypothesis. In other words, the effect you are looking for does not exist in reality, but the conclusion of your study is that the effect is real. This is a false scientific claim. For the example above, the type I error would be claiming that physical exercise increases mood when it actually doesn't.
Type II error
Falsely accepting the null hypothesis. In other words, the effect you are looking for is real, but the outcome of your research is that there is no effect. This is a missed scientific discovery. For the example above, the type II error would be claiming that exercise has no impact on mood, even though it does.

What type of error is worse? Obviously, the impact of an error depends on many factors. However, generally speaking a type I error is worse since the trial is more likely to be published and instigate change. For example, if you are testing the efficacy of a new psychoactive drug, a type I error may result in the drug being released to the public. This is potentially dangerous, as you are exposing people to the risk of side effects for a drug that doesn't work.

  • To decrease the probability of a type I error you can lower your significance level (e.g. from 0.05 to 0.01).
  • To decrease the probability of a type II error you can increase the power of you analysis by, for example, increasing your sample size.

Experiment design

With experimental research, the general strategy is to manipulate one aspect of the trial (the independent variable) and measure the impact on another aspect of the trial (the dependent variable). There are two primary methods of data collection:

Independent
groups
The data are collected from two different groups of people. For example, assume we are testing our exercise hypothesis. We could compare happiness levels between the following two groups:
  1. A control group that does not do regular exercise.
  2. A treatment group that exercises regularly.
Same subjects
The data consists of multiple measurements from the same group of participants. For example, we could compare happiness levels between the following data sets:
  1. Happiness levels before starting an exercise regime.
  2. Happiness levels for the same group of people after several months of training.

The distinction between independent groups and same subject designs is important since different statistical tests are used for hypothesis testing. In general, same subject research designs have more statistical power since there are fewer sources of variation in the experiment. Note that a randomized controlled trial (RCT), the golden standard of clinical trials, combines both design types by having pre and post measures for both a control and treatment group.

Parametric vs non-parametric models

The statistical tests on the following pages can be categorized as either parametric or non-parametric. Parametric tests make certain assumptions about the nature of the underlying data, while non-parametric tests are more general. Parametric tests tend to have more statistical power than their non-parametric counterparts, so should be used when applicable. However, if their assumptions are violated, they may give incorrect or misleading results.

This choice between parametric and non-parametric models is based on the intrinsic nature of the data, and is therefore outside of the control of the experimenter. Therefore, you should always examine your data and conduct tests to verify the assumptions where appropriate.

The most common assumption for the parametric tests is that the assumption of normality. Typically, the assumption of normality applies to the sampling distribution, rather than the underlying data. This is good news since it is usually satisfied for sufficient large data sets (e.g. N > 30) due to the central limit theorem. In general, normality of the underlying data is sufficient, but not necessary, for the use of parametric tests.

Aside: The are some fundamental problems with the theory and practice of "null hypothesis significance testing". In fact, it has been coming under increasing criticism in recent years. The topic is too broad to go into here, so instead I refer you to Andy Field's great discussion of the main issues in his book "Discovering Statistics Using SPSS".