Subject Matter Question

Are prices for 91 unleaded higher in Auckland compared to Hamilton?

Result

Analysis carried out below confirms that there is significant evidence to suggest that petrol prices for 91 unleaded are higher in Auckland compared to Hamilton.

Methodology

Data Collection

Load and prepare data

# Read in dataset and attach
petrol <-read.csv('petrol.csv', header=TRUE)

# Set seed
set.seed(123)

# Shuffle and sample 30
shuffled_hamilton <- sample(na.omit(petrol$Hamilton), 50)
shuffled_auckland <- sample(na.omit(petrol$Auckland), 50)

# Combine into a data frame
data <- data.frame(Auckland = shuffled_auckland, Hamilton = shuffled_hamilton)
attach(data)

Explore the data

What is the central tendency of the data?

The median for Hamilton is 298.5 vs 314.0 for Auckland. This suggests that the prices in Auckland are higher than Hamilton.

summary(data)
##     Auckland        Hamilton    
##  Min.   :290.0   Min.   :283.0  
##  1st Qu.:303.8   1st Qu.:289.2  
##  Median :311.0   Median :298.5  
##  Mean   :309.6   Mean   :296.3  
##  3rd Qu.:315.0   3rd Qu.:301.0  
##  Max.   :321.0   Max.   :311.0

What is the distribution of the data?

Histograms

Histograms show similar shapes. The peak suggests a degree of competitiveness between petrol stations.

# Create histograms with aesthetic mappings to the x-axis and 10 binned values
hamilton_hist <- ggplot(data, aes(x = Hamilton)) +
  geom_histogram(bins=10, color="black", fill="#fdca40")

auckland_hist <- ggplot(data, aes(x = Auckland)) +
  geom_histogram(bins=10, color="black", fill="#48cae4")

# Output the histograms in a side by side view
plot_grid(hamilton_hist, auckland_hist)

Stem and Leaf plots

Stem and leaf plots confirm a degree of normality, however, the variance looks to be different.

Hamilton stem and leaf plot
stem(Hamilton)
## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##   28 | 33344
##   28 | 67777779
##   29 | 0233
##   29 | 66666888999999999
##   30 | 001112444444
##   30 | 999
##   31 | 1
Auckland stem and leaf plot
stem(Auckland)
## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##   29 | 01
##   29 | 6689
##   30 | 1233333
##   30 | 667779999
##   31 | 00111222334444
##   31 | 5555679999
##   32 | 0011

Box and Whisker plots

The boxes are approximately the same size which suggests a similar spread of the 50% of data. The box for Auckland has a higher position in the graph which suggests that the spread of data is higher than Hamilton.

# Define custom colours
custom_colours <- c("#48cae4", "#fdca40")

# gather() reshapes the data from a wide format (Auckland Hamilton) to a long format with 2 new columns "type" and "value". This will enable the "type" column to be used in the fill aesthetic
box_whisker <- data %>%
  gather(key = "type", value = "value", Hamilton, Auckland) %>%
  ggplot( aes(x=type, y=value, fill=type)) +
    geom_boxplot() +
    scale_fill_manual(values = custom_colours, guide = FALSE) +
    geom_jitter(color="black", size=1, alpha=0.9) +
    theme(
      legend.position="none",
      plot.title = element_text(size=11)
    ) +
    xlab("")
print(box_whisker)

Is the data normally distributed?

QQ Plots

We are looking for most points lying along the diagonal and about 95% of the points situated within the confidence interval bands. Hamilton has some points outside of the band.

# Create 1 row and 2 column structure
par(mfrow = c(1, 2))

# Create qq plot for North data
qqplot_hamilton <- qqPlot(Hamilton, envelope=list(col="#48cae4"))

# Create qq plot for South data
qqplot_auckland <- qqPlot(Auckland, envelope=list(col="#fdca40"))

Shapiro-Wilk test for normality

The null hypothesis is that the data follows a normal distribution. A p-value that is < 0.05 means that we can reject the null hypothesis since there is evidence that the data is not normally distributed. If the p-value is > 0.05, then there is evidence that the data comes from a normally distributed population.

Both p-values are < 0.05. Therefore, there is evidence that the data is not normally distributed.

# Output shapiro tests
shapiro.test(Auckland)
## 
##  Shapiro-Wilk normality test
## 
## data:  Auckland
## W = 0.95363, p-value = 0.04818
shapiro.test(Hamilton)
## 
##  Shapiro-Wilk normality test
## 
## data:  Hamilton
## W = 0.94148, p-value = 0.01543

Comparison of median and mean

The median and means for Hamilton are 298.5 and 296.3. Auckland is 311 and 309.6. The closer the values the more likely they are normal. They are quite close, so it suggests that they are normal.

What are the null and alternate hypothesis?

Let \(\mu_1^2\) be the sample mean for Auckland and \(\mu_2^2\) be for the Hamilton

# Define a table of figures and output
tb1 = data.frame(c("Null hypothesis - Auckland > Hamilton","Alternative hypothesis - Auckland < Hamilton","Level of significance", "Sample sizes"),
  c("$H_0:\\mu_1^2>\\mu_2^2$","$H_A:\\mu_1^2<\\mu_2^2$","$\\alpha=0.05$","$n_1=50, n_2=50$"))
knitr::kable(tb1, escape = FALSE, col.names = NULL)
Null hypothesis - Auckland > Hamilton \(H_0:\mu_1^2>\mu_2^2\)
Alternative hypothesis - Auckland < Hamilton \(H_A:\mu_1^2<\mu_2^2\)
Level of significance \(\alpha=0.05\)
Sample sizes \(n_1=50, n_2=50\)

Explanation

A two sample t-test is required as we have two samples. We are assuming that they are taken from independent populations. Normally, if the samples follow a normal distribution, then we would do a pooled-variance t-test. However, in our case the data are not normally distributed, but, it is still appropriate if the sample sizes are large enough, typically > 30.

We are interested in if Auckland > Hamilton, so we need to do a one-sided test.

Checking the variance

Variances are equal under the null hypothesis. The alternative is that they are not. The p-value is > 0.05 and the F statistic ratio is near to 1. Therefore, we do not reject the null hypothesis. We conclude that the variances are equal.

# Run the test 
var.test(Auckland, Hamilton)
## 
##  F test to compare two variances
## 
## data:  Auckland and Hamilton
## F = 1.0218, num df = 49, denom df = 49, p-value = 0.9401
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.5798495 1.8006116
## sample estimates:
## ratio of variances 
##           1.021804

Two-sample t-test with pooled variance

The p-value is < 0.05. Therefore, we do not reject the null hypothesis. There is statistically significant evidence to suggest that the prices for Auckland are higher than Hamilton.

# Run the test, var.equal=TRUE for pooled variance
t.test(Auckland, Hamilton, alternative="greater", var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  Auckland and Hamilton
## t = 8.6754, df = 98, p-value = 4.501e-14
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  10.75426      Inf
## sample estimates:
## mean of x mean of y 
##    309.58    296.28