Routine Normality Testing in R
Assumption of Normality
Assumption of normality is a prequisite for most basic statistics applied. For Example the display of a summary for a variable into mean and standard definition comes from the normality distribution beeing completly defined by mean and standard definition. So how to check this assumption?
Lets first load up some datasets
# install.packages("MASS")
library(MASS)
df<-Aids2
Now lets have a look at the structure of this dataset
str(df)
## 'data.frame': 2843 obs. of 7 variables: ## $ state : Factor w/ 4 levels "NSW","Other",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ sex : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ... ## $ diag : int 10905 11029 9551 9577 10015 9971 10746 10042 10464 10439 ... ## $ death : int 11081 11096 9983 9654 10290 10344 11135 11069 10956 10873 ... ## $ status : Factor w/ 2 levels "A","D": 2 2 2 2 2 2 2 2 2 2 ... ## $ T.categ: Factor w/ 8 levels "hs","hsid","id",..: 1 1 1 5 1 1 8 1 1 2 ... ## $ age : int 35 53 42 44 39 36 36 31 26 27 ...
Let’s explore the age of these AIDS patients by plotting a histogramm
hist(df$age)
As we can see, this looks alot like a normal distribution, lets test this
Shapiro-Wilk
There are many different tests to check normality, dependend on the situation. For example with many outliers can disturb a test. However among the most frequently used test is Shapiro-Wilk test.
Let’s create a normal vector:
v<-rnorm(100,0,5)
hist(v)
Shapiro-Wilk works great on this:
shapiro.test(v)
## ## Shapiro-Wilk normality test ## ## data: v ## W = 0.99211, p-value = 0.8289
Using this test however on real data is a little disappointing:
shapiro.test(df$age)
## ## Shapiro-Wilk normality test ## ## data: df$age ## W = 0.97294, p-value < 2.2e-16
So this is it? The most cited test does not work so we can never use the mean?
Well not so fast.
QQ-Plots
QQ-Plots are way of ploting the quantile of data. One says that if the data stays on a straight line and does not bend to strong onto it or from it, the data is normally distributed. Let’s see
qqnorm(v)
Now let’s check our real-life data
qqnorm(df$age)
Ok that might look a little more convincing, but what is the benefit over a histogramm?
The answer is linear regression.
Linear Regression and Model fit
We can recreate the quantiles for our data and safe them in vector:
q<-quantile(df$age,seq(0,1,0.01))
plot(q)
This is more or les similar to our Quantile-Plot.
Now let’s do a linear regression of our data on these quantiles and their index:
fit<-lm(unname(q)~seq(1,101))
plot(q)
abline(fit)
As we can see a line fits the data quite good except at the edges. But where is this in numbers?
Let’s look at summary of the regression:
summary(fit)
## ## Call: ## lm(formula = unname(q) ~ seq(1, 101)) ## ## Residuals: ## Min 1Q Median 3Q Max ## -20.1127 -2.3021 -0.3599 1.8448 27.1381 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 19.7652 0.8457 23.37 <2e-16 *** ## seq(1, 101) 0.3475 0.0144 24.14 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 4.218 on 99 degrees of freedom ## Multiple R-squared: 0.8548, Adjusted R-squared: 0.8533 ## F-statistic: 582.6 on 1 and 99 DF, p-value: < 2.2e-16
We can see the Adjusted R-squared, which is good estimator for model fit, indicating that 85 % of data get accounted for in this regression. Now 85 % is also where I draw the line for a normal distribution.
Let’s look for comparison onto a variable where the normality assumption does not hold:
hist(df$death)
For death therefor the mean is not really a good summary statistic.
Quantile Plot is again very distinctive:
q<-quantile(df$death,seq(0,1,0.01))
plot(q)
And we see this in a poor regression line.
fit<-lm(unname(q)~seq(1,101))
plot(q)
abline(fit)
Our summary fit is still 79 %, not a bad fit, but not good enough.
summary(fit)
## ## Call: ## lm(formula = unname(q) ~ seq(1, 101)) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1514.2 -195.2 106.8 243.9 275.6 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 9963.267 60.846 163.75 <2e-16 *** ## seq(1, 101) 19.915 1.036 19.23 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 303.5 on 99 degrees of freedom ## Multiple R-squared: 0.7888, Adjusted R-squared: 0.7866 ## F-statistic: 369.7 on 1 and 99 DF, p-value: < 2.2e-16