Are prices for 91 unleaded higher in Auckland compared to Hamilton?
Analysis carried out below confirms that there is significant evidence to suggest that petrol prices for 91 unleaded are higher in Auckland compared to Hamilton.
# Read in dataset and attach
petrol <-read.csv('petrol.csv', header=TRUE)
# Set seed
set.seed(123)
# Shuffle and sample 30
shuffled_hamilton <- sample(na.omit(petrol$Hamilton), 50)
shuffled_auckland <- sample(na.omit(petrol$Auckland), 50)
# Combine into a data frame
data <- data.frame(Auckland = shuffled_auckland, Hamilton = shuffled_hamilton)
attach(data)
The median for Hamilton is 298.5 vs 314.0 for Auckland. This suggests that the prices in Auckland are higher than Hamilton.
summary(data)
## Auckland Hamilton
## Min. :290.0 Min. :283.0
## 1st Qu.:303.8 1st Qu.:289.2
## Median :311.0 Median :298.5
## Mean :309.6 Mean :296.3
## 3rd Qu.:315.0 3rd Qu.:301.0
## Max. :321.0 Max. :311.0
Histograms show similar shapes. The peak suggests a degree of competitiveness between petrol stations.
# Create histograms with aesthetic mappings to the x-axis and 10 binned values
hamilton_hist <- ggplot(data, aes(x = Hamilton)) +
geom_histogram(bins=10, color="black", fill="#fdca40")
auckland_hist <- ggplot(data, aes(x = Auckland)) +
geom_histogram(bins=10, color="black", fill="#48cae4")
# Output the histograms in a side by side view
plot_grid(hamilton_hist, auckland_hist)
Stem and leaf plots confirm a degree of normality, however, the variance looks to be different.
stem(Hamilton)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 28 | 33344
## 28 | 67777779
## 29 | 0233
## 29 | 66666888999999999
## 30 | 001112444444
## 30 | 999
## 31 | 1
stem(Auckland)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 29 | 01
## 29 | 6689
## 30 | 1233333
## 30 | 667779999
## 31 | 00111222334444
## 31 | 5555679999
## 32 | 0011
The boxes are approximately the same size which suggests a similar spread of the 50% of data. The box for Auckland has a higher position in the graph which suggests that the spread of data is higher than Hamilton.
# Define custom colours
custom_colours <- c("#48cae4", "#fdca40")
# gather() reshapes the data from a wide format (Auckland Hamilton) to a long format with 2 new columns "type" and "value". This will enable the "type" column to be used in the fill aesthetic
box_whisker <- data %>%
gather(key = "type", value = "value", Hamilton, Auckland) %>%
ggplot( aes(x=type, y=value, fill=type)) +
geom_boxplot() +
scale_fill_manual(values = custom_colours, guide = FALSE) +
geom_jitter(color="black", size=1, alpha=0.9) +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
xlab("")
print(box_whisker)
We are looking for most points lying along the diagonal and about 95% of the points situated within the confidence interval bands. Hamilton has some points outside of the band.
# Create 1 row and 2 column structure
par(mfrow = c(1, 2))
# Create qq plot for North data
qqplot_hamilton <- qqPlot(Hamilton, envelope=list(col="#48cae4"))
# Create qq plot for South data
qqplot_auckland <- qqPlot(Auckland, envelope=list(col="#fdca40"))
The null hypothesis is that the data follows a normal distribution. A p-value that is < 0.05 means that we can reject the null hypothesis since there is evidence that the data is not normally distributed. If the p-value is > 0.05, then there is evidence that the data comes from a normally distributed population.
Both p-values are < 0.05. Therefore, there is evidence that the data is not normally distributed.
# Output shapiro tests
shapiro.test(Auckland)
##
## Shapiro-Wilk normality test
##
## data: Auckland
## W = 0.95363, p-value = 0.04818
shapiro.test(Hamilton)
##
## Shapiro-Wilk normality test
##
## data: Hamilton
## W = 0.94148, p-value = 0.01543
The median and means for Hamilton are 298.5 and 296.3. Auckland is 311 and 309.6. The closer the values the more likely they are normal. They are quite close, so it suggests that they are normal.
Let \(\mu_1^2\) be the sample mean for Auckland and \(\mu_2^2\) be for the Hamilton
# Define a table of figures and output
tb1 = data.frame(c("Null hypothesis - Auckland > Hamilton","Alternative hypothesis - Auckland < Hamilton","Level of significance", "Sample sizes"),
c("$H_0:\\mu_1^2>\\mu_2^2$","$H_A:\\mu_1^2<\\mu_2^2$","$\\alpha=0.05$","$n_1=50, n_2=50$"))
knitr::kable(tb1, escape = FALSE, col.names = NULL)
Null hypothesis - Auckland > Hamilton | \(H_0:\mu_1^2>\mu_2^2\) |
Alternative hypothesis - Auckland < Hamilton | \(H_A:\mu_1^2<\mu_2^2\) |
Level of significance | \(\alpha=0.05\) |
Sample sizes | \(n_1=50, n_2=50\) |
A two sample t-test is required as we have two samples. We are assuming that they are taken from independent populations. Normally, if the samples follow a normal distribution, then we would do a pooled-variance t-test. However, in our case the data are not normally distributed, but, it is still appropriate if the sample sizes are large enough, typically > 30.
We are interested in if Auckland > Hamilton, so we need to do a one-sided test.
Variances are equal under the null hypothesis. The alternative is that they are not. The p-value is > 0.05 and the F statistic ratio is near to 1. Therefore, we do not reject the null hypothesis. We conclude that the variances are equal.
# Run the test
var.test(Auckland, Hamilton)
##
## F test to compare two variances
##
## data: Auckland and Hamilton
## F = 1.0218, num df = 49, denom df = 49, p-value = 0.9401
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.5798495 1.8006116
## sample estimates:
## ratio of variances
## 1.021804
The p-value is < 0.05. Therefore, we do not reject the null hypothesis. There is statistically significant evidence to suggest that the prices for Auckland are higher than Hamilton.
# Run the test, var.equal=TRUE for pooled variance
t.test(Auckland, Hamilton, alternative="greater", var.equal=TRUE)
##
## Two Sample t-test
##
## data: Auckland and Hamilton
## t = 8.6754, df = 98, p-value = 4.501e-14
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 10.75426 Inf
## sample estimates:
## mean of x mean of y
## 309.58 296.28