When a training dataset does not conform to any specific probability distribution because of non-adherence to the assumptions of that specific probability distribution, the only option left to analyze the data is via non-parametric methods. Non-parametric methods do not follow any assumption regarding the probability distribution. Using non-parametric methods, one can draw inferences and perform hypothesis testing without adhering to any assumptions. Now let's look at a set of on-parametric tests that can be used when a dataset does not conform to the assumptions of any specific probability distribution.
If the assumption of normality is violated, then it is required to apply non-parametric methods in order to answer a question such as: is there any difference in the mean mileage within the city between automatic and manual transmission type cars?
> wilcox.test(Cars93$MPG.city~Cars93$Man.trans.avail, correct = F) Wilcoxon rank sum test data: Cars93$MPG.city by Cars93$Man.trans.avail W = 380, p-value = 1e-06 alternative hypothesis: true location shift is not equal to 0
The argument paired can be used if the two samples happen to be matching pairs and the samples do not follow the assumptions of normality:
> wilcox.test(Cars93$MPG.city, Cars93$MPG.highway, paired = T) Wilcoxon signed rank test with continuity correction data: Cars93$MPG.city and Cars93$MPG.highway V = 0, p-value <2e-16 alternative hypothesis: true location shift is not equal to 0
If two samples are not matched, are independent, and do not follow a normal distribution, then it is required to use Mann-Whitney-Wilcoxon test to test the hypothesis that the mean difference in the two samples are statistically significantly different from each other:
> wilcox.test(Cars93$MPG.city~Cars93$Man.trans.avail, data=Cars93) Wilcoxon rank sum test with continuity correction data: Cars93$MPG.city by Cars93$Man.trans.avail W = 380, p-value = 1e-06 alternative hypothesis: true location shift is not equal to 0
To compare means of more than two groups, that is, the non-parametric side of ANOVA analysis, we can use the Kruskal-Wallis test. It is also known as a distribution-free statistical test:
> kruskal.test(Cars93$MPG.city~Cars93$Cylinders, data= Cars93) Kruskal-Wallis rank sum test data: Cars93$MPG.city by Cars93$Cylinders Kruskal-Wallis chi-squared = 68, df = 5, p-value = 3e-13
3.145.36.221