R for Marketing Students(一)

R for Marketing Students(一)

这是一本开源的书,在线阅读的地址是:R for marketing students,从前两章看来,这本书的目标读者是对R没有基础的人。因为自己已经有了一些R的基础,所以我会跳跃性的阅读并选择性的记录。

数据框操作

导入与存储数据

如果作者的链接崩溃了,可以用我的:airbnb.rds

1
2
3
4
5
6
7
8
9
10
11
library(tidyverse)
library(io)

# airbnb <- read_csv('http://users.telenet.be/samuelfranssens/tutorial_data/tomslee_airbnb_belgium_1454_2017-07-14.csv')
# qwrite(airbnb, "airbnb.rds")

airbnb <- qread('airbnb.rds')

# 审视数据
head(airbnb)
print(airbnb, n = 25, width = 100)

变量转换

1
2
3
4
5
6
7
8
9
10
# 变量转换
# mutate函数:变量修改与创建
airbnb <- airbnb %>%
mutate(
room_id = factor(room_id),
host_id = factor(host_id),
overall_satisfaction_100 = overall_satisfaction * 20
) %>%
select(-country, -survey_id) %>% # 删除这两个变量
rename(country = city, city = borough) # 变量重命名

使用 %in% 操作符

1
2
3
4
5
library(Hmisc)
topten <- c("Brussel","Antwerpen","Gent","Charleroi","Liege","Brugge","Namur","Leuven","Mons","Aalst")

# 因为dplyr包中的filter函数和stats包中的filter函数相互冲突了,所以这里使用dplyr::引用。
airbnb.topten <- dplyr::filter(airbnb, city %in% topten)

分组与汇总

1
2
3
4
airbnb %>%
group_by(city) %>%
summarise(nr_per_city = n()) %>% # 统计每组的数量(length()函数可以)
arrange(desc(nr_per_city)) # 倒序排列

描述性统计量

1
2
3
4
5
6
7
8
9
airbnb %>%
group_by(city) %>%
summarise(nr_per_city = n(),
avg_price = mean(price, na.rm = T)) %>%
arrange(desc(avg_price)) %>%
print(n = Inf)

# Inf是无穷大
is.infinite(Inf) # True

绘图

对数转换

1
2
3
4
5
airbnb.topten %>%
ggplot(aes(
x = city, y = log(price, base = exp(1))
)) +
geom_jitter()

绘制中位数

1
2
3
4
5
6
7
airbnb.topten %>%
ggplot(aes(
x = city, y = price
)) +
geom_jitter() +
stat_summary(fun.y = median, colour = 'tomato3',
size = 4, geom = "point")

绘制均值

1
2
3
4
5
6
7
8
9
10
airbnb.topten %>%
ggplot(aes(
x = city, y = price
)) +
geom_jitter() +
stat_summary(fun.y = median, colour = 'tomato3',
size = 4, geom = 'point') +
stat_summary(fun.y = mean, colour = 'green',
size = 4, geom = 'point',
shape = 23, fill = 'green')

基本数据分析

1
2
3
4
5
6
7
8
9
10
11
12
airbnb <- qread('airbnb.rds') %>%
mutate(
room_id = factor(room_id),
host_id = factor(host_id)
) %>%
select(-country, -survey_id) %>%
rename(country = city, city = borough) %>%
select(-bathrooms, -minstay,
-location, -last_modified) %>%
mutate(overall_satisfaction = replace(
overall_satisfaction,
overall_satisfaction == 0, NA))

合并数据

下载数据:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
population <- xlsx::read.xlsx2("population.xlsx", sheetIndex = 1) %>% as.tibble()

# 或者:
population <- readxl::read_excel('population.xlsx')

# 审视数据
population %>% head()

# 把airbnb和population横向合并,把airbnb数据集中的city和population数据集中的place变量视为同一个变量
# 首先要改正地名不一致的问题:
population <- population %>%
mutate(
place = replace(place, place == 'Brussels', 'Brussel'),
place = replace(place, place == 'Ostend', 'Oostende'),
place = replace(place, place == 'Mouscron', 'Moeskroen')
)

airbnb.merged <- left_join(airbnb,
population,
by = c('city' = 'place'))
airbnb.merged %>%
select(room_id, city, price, population)

独立样本t检验

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# 首先标记人口超过10万的为大城市
airbnb <- airbnb.merged %>%
mutate(large = population > 100000,
size = factor(large, labels = c('small', 'large')))

airbnb %>%
group_by(size, city) %>%
summarise(count = n(),
population = mean(population)) %>%
arrange(desc(size), desc(population)) %>%
dplyr::filter(!is.na(population)) %>%
print(n = Inf)

airbnb.cities <- airbnb %>%
dplyr::filter(!is.na(population))

airbnb.cities %>%
group_by(size) %>%
summarise(mean_price = mean(price),
sd_price = sd(price),
count = n())
size mean_price sd_price count
small 110.31265 121.63090 4270
large 85.41809 82.46392 11696
1
2
3
4
5
6
7
8
9
10
11
# 我们想检验大城市和小城市的房价是否一样。
library(car)

# 首先检验两个样本的方差是否相等
leveneTest(airbnb.cities$price, airbnb.cities$size)
# 结果p < 2.2e-16,拒绝的等方差的原假设,因此下面进行不等方差的T检验。

# 不等方差的T检验
t.test(airbnb.cities$price ~ airbnb.cities$size,
var.equal = F)
# p < 2.2e-16 拒绝等均值的原假设。

单向方差分析

1
2
3
4
5
6
7
8
# 假如我们要检验套房、私人房间和共享房间的均价是否存在差异:
(airbnb.summary <- airbnb %>%
group_by(room_type) %>%
summarise(
count = n(),
mean_price = mean(price),
sd_price = sd(price)
))
room_type count mean_price sd_price
Entire home/apt 11082 113.40615 117.63628
Private room 6416 64.29099 46.46081
Shared room 153 49.61438 33.94716

绘图比较

1
2
3
4
airbnb.summary %>%
ggplot(aes(room_type, mean_price, fill = room_type)) +
geom_bar(stat = 'identity', position = 'dodge') +
scale_fill_brewer('房间类型', palette = 'Set1')

可以看出,套房的价格最高,还可以看到套房的数量是另外两者的二倍,且标准差更高。
ANOVA检验可以检验每种房型的平均价格是否存在显著差异,在进行方差分析前,我们需要先检验样本是否复合ANOVA的假设。

假设1: 残差的正态性

1
2
3
4
5
# 首先画图观察一下:
ggplot(data = airbnb,
aes(price)) +
facet_wrap(~ room_type) +
geom_histogram()

可以看到,房价的分布是右偏的。
我们还可以使用Shapiro-Wilk正态性检验:
首先是共享房间:

1
2
3
4
5
6
7
8
9
10
11
12
airbnb %>%
dplyr::filter(room_type == 'Shared room') ->
airbnb.shared

airbnb.shared$price %>% shapiro.test()

# 结果

Shapiro-Wilk normality test

data: .
W = 0.83948, p-value = 1.181e-11

p值很小,拒绝正态性的原假设。

然后是私人房间:

1
2
3
4
5
airbnb %>%
dplyr::filter(room_type == 'Private room') ->
airbnb.private

shapiro.test(airbnb.private$price)
1
Error in shapiro.test(airbnb.private$price) : 样本大小必需在35000之间

结果显示样本数量过大,我们可以用nortest包中的Anderson-Darling检验:

1
2
3
4
5
6
7
> library(nortest)
> ad.test(airbnb.private$price)

Anderson-Darling normality test

data: airbnb.private$price
A = 372.05, p-value < 2.2e-16

结果拒绝正态性的原假设。解决数据非正态分布的一个办法就是取对数,例如我们观察价格取对数之后的分布:

1
2
3
4
5
6
airbnb %>%
ggplot(aes(
log(price, base = exp(1))
)) +
facet_wrap(~ room_type) +
geom_histogram()

可以看到,这个时候分布已经比较正态了。
实际上检验一下,你会发现还是不能通过正态检验,但是这里为了简便,我们忽略这个问题,仍然使用没有变换的价格。

1
2
3
4
5
6
> ad.test(log(airbnb.private$price, base = exp(1)))

Anderson-Darling normality test

data: log(airbnb.private$price, base = exp(1))
A = 31.517, p-value < 2.2e-16

假设2: 同方差

首先可以绘制箱线图进行观察:

1
2
3
airbnb %>%
ggplot(aes(room_type, price)) +
geom_boxplot(aes(fill = room_type))

1
2
3
4
5
6
airbnb %>%
group_by(room_type) %>%
summarise(count = n(),
mean_price = mean(price),
sd_price = sd(price)) %>%
knitr::kable()
room_type count mean_price sd_price
Entire home/apt 11082 113.40615 117.63628
Private room 6416 64.29099 46.46081
Shared room 153 49.61438 33.94716

同方差检验:

1
2
3
4
5
6
7
> leveneTest(airbnb$price, airbnb$room_type)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 140.07 < 2.2e-16 ***
17648
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

结果拒绝同方差的假设,同样,我们忽略这个问题。

方差分析

1
2
3
4
5
6
7
# devtools::install_github('samuelfranssens/type3anova')
library(type3anova)

# 首先创建一个线性模型
linearmodel <- lm(price ~ room_type, data = airbnb)
# 方差分析
type3anova(linearmodel)
term ss df1 df2 f pvalue
(Intercept) 7618725 1 17648 803.3665 0
room_type 10120155 2 17648 533.5666 0
Residuals 167364763 17648 17648 NA NA

Tukey’s honest significant difference test

方差分析的原假设是所有组的均值相等,拒绝这个原假设意味着至少有两组的均值是不相等的。为了了解是哪一对均值不相等,我们可以进行Tukey检验,该检验为我们提供所有成对的比较。

diff lwr upr p adj
Private room-Entire home/apt -49.11516 -52.69593 -45.534395 0.000000
Shared room-Entire home/apt -63.79178 -82.37217 -45.211381 0.000000
Shared room-Private room -14.67661 -33.34879 3.995562 0.155939

可以看出,有差异的是共享房间和私人房间。

# R

评论

程振兴

程振兴 @czxa.top
截止今天,我已经在本博客上写了767.8k个字了!

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×