April 16, 2019

Routine Normality Testing in R

Assumption of Normality

Assumption of normality is a prequisite for most basic statistics applied. For Example the display of a summary for a variable into mean and standard definition comes from the normality distribution beeing completly defined by mean and standard definition. So how to check this assumption?

Lets first load up some datasets

# install.packages("MASS")
library(MASS)
df<-Aids2

Now lets have a look at the structure of this dataset

str(df)
## 'data.frame':    2843 obs. of  7 variables:
##  $ state  : Factor w/ 4 levels "NSW","Other",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex    : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ diag   : int  10905 11029 9551 9577 10015 9971 10746 10042 10464 10439 ...
##  $ death  : int  11081 11096 9983 9654 10290 10344 11135 11069 10956 10873 ...
##  $ status : Factor w/ 2 levels "A","D": 2 2 2 2 2 2 2 2 2 2 ...
##  $ T.categ: Factor w/ 8 levels "hs","hsid","id",..: 1 1 1 5 1 1 8 1 1 2 ...
##  $ age    : int  35 53 42 44 39 36 36 31 26 27 ...

Let’s explore the age of these AIDS patients by plotting a histogramm

hist(df$age)

plot of chunk unnamed-chunk-3

As we can see, this looks alot like a normal distribution, lets test this

Shapiro-Wilk

There are many different tests to check normality, dependend on the situation. For example with many outliers can disturb a test. However among the most frequently used test is Shapiro-Wilk test.

Let’s create a normal vector:

v<-rnorm(100,0,5)
hist(v)

plot of chunk unnamed-chunk-4

Shapiro-Wilk works great on this:

shapiro.test(v)
## 
##  Shapiro-Wilk normality test
## 
## data:  v
## W = 0.99211, p-value = 0.8289

Using this test however on real data is a little disappointing:

shapiro.test(df$age)
## 
##  Shapiro-Wilk normality test
## 
## data:  df$age
## W = 0.97294, p-value < 2.2e-16

So this is it? The most cited test does not work so we can never use the mean?
Well not so fast.

QQ-Plots

QQ-Plots are way of ploting the quantile of data. One says that if the data stays on a straight line and does not bend to strong onto it or from it, the data is normally distributed. Let’s see

qqnorm(v)

plot of chunk unnamed-chunk-7

Now let’s check our real-life data

qqnorm(df$age)

plot of chunk unnamed-chunk-8

Ok that might look a little more convincing, but what is the benefit over a histogramm?
The answer is linear regression.

Linear Regression and Model fit

We can recreate the quantiles for our data and safe them in vector:

q<-quantile(df$age,seq(0,1,0.01))
plot(q)

plot of chunk unnamed-chunk-9

This is more or les similar to our Quantile-Plot.

Now let’s do a linear regression of our data on these quantiles and their index:

fit<-lm(unname(q)~seq(1,101))
plot(q)
abline(fit)

plot of chunk unnamed-chunk-10

As we can see a line fits the data quite good except at the edges. But where is this in numbers?
Let’s look at summary of the regression:

summary(fit)
## 
## Call:
## lm(formula = unname(q) ~ seq(1, 101))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.1127  -2.3021  -0.3599   1.8448  27.1381 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19.7652     0.8457   23.37   <2e-16 ***
## seq(1, 101)   0.3475     0.0144   24.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.218 on 99 degrees of freedom
## Multiple R-squared:  0.8548, Adjusted R-squared:  0.8533 
## F-statistic: 582.6 on 1 and 99 DF,  p-value: < 2.2e-16

We can see the Adjusted R-squared, which is good estimator for model fit, indicating that 85 % of data get accounted for in this regression. Now 85 % is also where I draw the line for a normal distribution.

Let’s look for comparison onto a variable where the normality assumption does not hold:

hist(df$death)

plot of chunk unnamed-chunk-12

For death therefor the mean is not really a good summary statistic.
Quantile Plot is again very distinctive:

q<-quantile(df$death,seq(0,1,0.01))
plot(q)

plot of chunk unnamed-chunk-13

And we see this in a poor regression line.

fit<-lm(unname(q)~seq(1,101))
plot(q)
abline(fit)

plot of chunk unnamed-chunk-14

Our summary fit is still 79 %, not a bad fit, but not good enough.

summary(fit)
## 
## Call:
## lm(formula = unname(q) ~ seq(1, 101))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1514.2  -195.2   106.8   243.9   275.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 9963.267     60.846  163.75   <2e-16 ***
## seq(1, 101)   19.915      1.036   19.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 303.5 on 99 degrees of freedom
## Multiple R-squared:  0.7888, Adjusted R-squared:  0.7866 
## F-statistic: 369.7 on 1 and 99 DF,  p-value: < 2.2e-16
Leave a comment