我的豆瓣观影报告

我的豆瓣观影报告

截止到昨天,我的豆瓣电影上终于收录了 500 部电影+电视剧,所以想爬下来研究研究。

爬取比较容易,我的习惯是先分析 xpath 的规律,然后再通过构造循环把所需要的信息都存放到数据框中。

电影数据爬取

首先把这 500 部电影爬下来,中间不知道怎么少了 4 部,就不管它们了。

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
library(rvest)
library(tidyverse)
library(progress)
start = 0
mov <- data.frame(
title = NA,
rating = NA,
href = NA,
img = NA,
collectdate = NA
)
library(progress)
pb <- progress_bar$new(total = 34)
for(i in 1:34){
pb$tick(0)
pb$tick()
html <- read_html(paste0('https://movie.douban.com/people/132624622/collect?start=', start, '&sort=time&rating=all&filter=all&mode=grid'))
for(n in 1:15){
mov <- rbind(mov, data.frame(
title = html %>% html_node(xpath = paste0('//*[@id="content"]/div[2]/div[1]/div[2]/div[', n, ']/div[2]/ul/li[1]/a')) %>% html_text() %>% str_replace_all(pattern = " ", replacement = "") %>% str_replace_all(pattern = "\n", replacement = "") %>% str_split("/") %>% .[[1]] %>% .[1],
rating = html %>% html_node(xpath = paste0('//*[@id="content"]/div[2]/div[1]/div[2]/div[', n, ']/div[2]/ul/li[3]/span[1]')) %>% html_attr('class') %>% str_extract("[0-9]"),
href = html %>% html_node(xpath = paste0('//*[@id="content"]/div[2]/div[1]/div[2]/div[', n, ']/div[2]/ul/li[1]/a')) %>% html_attr("href"),
img = html %>% html_node(xpath = paste0('//*[@id="content"]/div[2]/div[1]/div[2]/div[', n, ']/div[1]/a/img')) %>% html_attr('src'),
collectdate = html %>% html_node(xpath = paste0('//*[@id="content"]/div[2]/div[1]/div[2]/div[', n, ']/div[2]/ul/li[3]/span[2]')) %>% html_text()
))
}
start = start + 15
}

df <- mov %>%
filter(!is.na(href)) %>%
mutate(
collectdate = lubridate::ymd(collectdate),
rating = as.numeric(rating)
)

# write_rds(df, "douban.rds")

爬取结果:douban.rds

先进行一些简单的分析:

简单分析

观影评分分布

R
1
2
3
4
5
6
7
8
9
10
df <- read_rds("douban.rds")
df %>%
count(rating) %>%
ggplot(aes(x = rating, y = n)) +
geom_col(aes(fill = factor(rating))) +
scale_fill_brewer(name = "评分", palette = 'Set2') +
labs(x = "我对该电影的评分",
y = "电影数量",
title = "我的观影评分分布",
caption = "数据来源:豆瓣电影\nhttps://movie.douban.com/people/132624622/collect")

我是一个非常宽容的人,一般都是四颗星、五颗星的给,除非有些特别烂的电影,例如这些:

R
1
2
3
4
5
df %>% 
select(title, rating) %>%
arrange(rating) %>%
slice(1:10) %>%
knitr::kable(align = "c")
电影名称 我的评分(星星数量)
曼蒂 1
第三波 1
夺命双头鲨 1
守护者:世纪战元 1
第一缕曙光 2
冥王星时刻 2
惊变 28 天 2
猛虫过江 2
地狱男爵 2:黄金军团 2
尸油 3D 2

收录时间分布

R
1
2
3
4
5
6
7
8
9
10
df %>% 
count(collectdate) %>%
ggplot(aes(x = collectdate, y = n)) +
geom_line() +
scale_x_date(breaks = scales::date_breaks(width = '2 month'),
labels = scales::date_format()) +
labs(x = "我把该电影收录豆瓣的时间",
y = "电影数量",
caption = "数据来源:豆瓣电影\nhttps://movie.douban.com/people/132624622/collect",
title = "我的豆瓣电影收录")

去年 6 月 2 号的那个晚上闲着没事收录了将近两百部电影,后来基本都是看一部收录一部。

更为细致的爬取

为了爬取更多的数据,我们需要进入每个电影的主页爬取数据。

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
subject <- data.frame(
href = NA,
director = NA,
screenwrite = NA,
stars = NA,
genre = NA,
makecountry = NA,
language = NA,
releasedate = NA,
duration = NA,
othernames = NA,
doubanrating = NA,
doubanratingpeoplenum = NA
)

pb <- progress_bar$new(total = nrow(df))
for(i in 1:nrow(df)){
pb$tick(0)
pb$tick()
html <- read_html(df$href[i])
info <- html %>%
html_node(xpath = '//*[@id="info"]') %>%
html_text() %>%
str_replace_all("\\s", "") %>%
.[[1]]
subject <- rbind(
subject,
data.frame(
href = df$href[i],
director = info %>%
str_match("导演:(.*)编剧:") %>%
.[2],
screenwrite = info %>%
str_match("编剧:(.*)主演:") %>%
.[2],
stars = info %>%
str_match("主演:(.*)类型:") %>%
.[2],
genre = info %>%
str_match("类型:(.*)制片国家/地区:") %>%
.[2],
makecountry = info %>%
str_match("制片国家/地区:(.*)语言:") %>%
.[2],
language = info %>%
str_match("语言:(.*)上映日期:") %>%
.[2],
releasedate = info %>%
str_match("上映日期:(.*)片长:") %>%
.[2],
duration = info %>%
str_match("片长:(.*)又名:") %>%
.[2],
othernames = info %>%
str_match("又名:(.*)IMDb链接:") %>%
.[2],
doubanrating = html %>%
html_node(xpath = '//*[@id="interest_sectl"]/div[1]/div[2]/strong') %>%
html_text(),
doubanratingpeoplenum = html %>%
html_node(xpath = '//*[@id="interest_sectl"]/div[1]/div[2]/div/div[2]/a') %>%
html_text()
)
)
}

subject <- subject %>%
filter(!is.na(href))

# write_rds(subject, "subject.rds")

爬取结果:subject.rds

然后把这个结果和刚刚的那个进行合并:

R
1
2
3
4
5
6
7
8
9
10
subject <- read_rds("subject.rds")
douban <- df %>%
left_join(subject, by = "href")

douban <- douban %>%
mutate(
genre = gsub("官方网站:(.*)", "", genre)
)

write_rds(douban, "doubanmovies.rds")

最终的结果:doubanmovies.rds

进一步分析

首先读取并简单整理数据:

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
library(tidyverse)
library(lubridate)
douban <- read_rds("doubanmovies.rds") %>%
as_tibble() %>%
mutate(
releasedate = str_sub(releasedate, 1, 10),
releasedate = str_replace(releasedate, pattern = "\\(.*\\)", replacement = ""),
releasedate = releasedate %>% as_date(),
duration = str_extract(duration, "[0-9]+分钟") %>%
str_replace("分钟", "") %>%
as.numeric(),
doubanrating = doubanrating %>% as.numeric(),
doubanratingpeoplenum = doubanratingpeoplenum %>%
str_replace("人评价", "") %>%
as.numeric()
)

观影类型分布

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
library(tidytext)
colors <- c("#31CF65", "#FC3C63", "#3273DC", '#FFDD57', '#31CF65', '#FC3C63', '#3273DC', '#FFDD57', "#31CF65", "#FC3C63", "#3273DC", '#FFDD57',"#31CF65", "#FC3C63", "#3273DC", '#FFDD57', '#31CF65', '#FC3C63', '#3273DC', '#FFDD57', "#31CF65", "#FC3C63", "#3273DC", '#FFDD57', "#31CF65", "#FC3C63", "#3273DC", '#FFDD57')
douban %>%
unnest_tokens(genre, genre, token = stringr::str_split, pattern = "/") %>%
mutate(
genre = genre %>%
fct_infreq() %>%
fct_rev()
) %>%
ggplot(aes(x = genre)) +
geom_bar(aes(fill = genre)) +
labs(x = "电影类型", y = "数量",
title = "我的观影类型分布",
caption = "数据来源:豆瓣电影\nhttps://movie.douban.com/people/132624622/collect") +
coord_flip() +
scale_fill_manual(values = colors) +
guides(fill = "none")

我还是动作+科幻的最爱呀!

导演的分布

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
douban %>% 
unnest_tokens(director, director, token = stringr::str_split, pattern = "/") %>%
count(director) %>%
arrange(desc(n)) %>%
filter(!is.na(director)) %>%
slice(1:13) %>%
ggplot(aes(x = director %>% fct_reorder(n), y = n)) +
geom_col(aes(fill = director)) +
guides(fill = "none") +
coord_flip() +
scale_fill_manual(values = colors) +
labs(x = "电影数量", y = "导演的名字",
title = "我的观影导演Top13",
caption = "数据来源:豆瓣电影\nhttps://movie.douban.com/people/132624622/collect")

看到这幅图,我立即就想到这个芝山努一定是拍哆啦 A 梦的!

R
1
2
3
4
douban %>% 
filter(str_detect(director, "芝山努")) %>%
select(title, director, rating) %>%
knitr::kable(align = "c")
title director rating
哆啦 A 梦:大雄和发条都市 芝山努 5
哆啦 A 梦:大雄的宇宙小战争 芝山努 5
哆啦 A 梦:大雄的南海大冒险 芝山努 4
哆啦 A 梦:大雄与动物行星 芝山努 4
哆啦 A 梦:大雄的魔界大冒险 芝山努 4
哆啦 A 梦:大雄的日本诞生 芝山努 5
哆啦 A 梦:大雄与云之国 芝山努 4
哆啦 A 梦:大雄与机器人王国 芝山努 4
哆啦 A 梦:大雄与银河超特急 芝山努 5
哆啦 A 梦:大雄与白金迷宫 芝山努 5
哆啦 A 梦:大雄的阿拉伯之夜 芝山努 5
哆啦 A 梦:大雄的创世日记 芝山努 5
哆啦 A 梦:大雄的海底鬼岩城 芝山努 5
哆啦 A 梦:大雄与梦幻三剑士 芝山努 5
哆啦 A 梦:大雄与风之使者 芝山努 5
哆啦 A 梦:大雄的太阳王传说 芝山努 5
哆啦 A 梦:大雄的猫狗时空传 芝山努 4
大雄的结婚前夜 芝山努/渡边步 4

几乎都都给了 8 分、10 分。我是哆啦 A 梦的粉丝!

编剧的分布

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
douban %>% 
unnest_tokens(screenwrite, screenwrite, token = stringr::str_split, pattern = "/") %>%
count(screenwrite) %>%
arrange(desc(n)) %>%
filter(!is.na(screenwrite)) %>%
slice(1:10) %>%
ggplot(aes(x = screenwrite %>% fct_reorder(n), y = n)) +
geom_col(aes(fill = screenwrite)) +
guides(fill = "none") +
coord_flip() +
scale_fill_manual(values = colors) +
labs(x = "电影数量", y = "编剧的名字",
title = "我的观影编剧Top10",
caption = "数据来源:豆瓣电影\nhttps://movie.douban.com/people/132624622/collect")

显然排名第一的是写了哆啦 A 梦的藤子·f·不二雄,那么排名第二的斯坦·李写了哪些呢?

R
1
2
3
4
5
6
douban %>% 
unnest_tokens(screenwrite, screenwrite, token = stringr::str_split, pattern = "/") %>%
filter(str_detect(screenwrite, "斯坦·李")) %>%
distinct(screenwrite, title) %>%
select(title, screenwrite) %>%
knitr::kable(align = "c")
title screenwrite
绿巨人浩克 斯坦·李
雷神 3:诸神黄昏 斯坦·李
雷神 2:黑暗世界 斯坦·李
雷神 斯坦·李
钢铁侠 3 斯坦·李
神奇四侠 2 斯坦·李
神奇四侠 斯坦·李
蜘蛛侠 2 斯坦·李
蜘蛛侠:英雄归来 斯坦·李
蜘蛛侠 斯坦·李
神奇四侠 2015 斯坦·李
奇异博士 斯坦·李
超凡蜘蛛侠 2 斯坦·李
复仇者联盟 2:奥创纪元 斯坦·李
黑豹 斯坦·李
蚁人 斯坦·李

主演

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
douban %>% 
unnest_tokens(stars, stars, token = stringr::str_split, pattern = "/") %>%
count(stars) %>%
arrange(desc(n)) %>%
filter(!is.na(stars)) %>%
slice(1:10) %>%
ggplot(aes(x = stars %>% fct_reorder(n), y = n)) +
geom_col(aes(fill = stars)) +
guides(fill = "none") +
coord_flip() +
scale_fill_manual(values = colors) +
labs(x = "电影数量", y = "主演的名字",
title = "我的观影主演Top10",
caption = "数据来源:豆瓣电影\nhttps://movie.douban.com/people/132624622/collect")

前面的五位应该都是哆啦 A 梦系列电影的主演,那么看看第六位的安迪演了哪些电影:

R
1
2
3
4
5
6
douban %>% 
unnest_tokens(stars, stars, token = stringr::str_split, pattern = "/") %>%
filter(str_detect(stars, "安迪·瑟金斯")) %>%
distinct(stars, title) %>%
select(title, stars) %>%
knitr::kable(align = "c")
title stars
金刚 安迪·瑟金斯
指环王 2:双塔奇兵 安迪·瑟金斯
指环王 1:魔戒再现 安迪·瑟金斯
星球大战 7:原力觉醒 安迪·瑟金斯
猩球崛起 安迪·瑟金斯
霍比特人 1:意外之旅 安迪·瑟金斯
猩球崛起 2:黎明之战 安迪·瑟金斯
复仇者联盟 2:奥创纪元 安迪·瑟金斯
星球大战 8:最后的绝地武士 安迪·瑟金斯
指环王 3:王者无敌 安迪·瑟金斯
猩球崛起 3:终极之战 安迪·瑟金斯
黑豹 安迪·瑟金斯

百度了一下,之所以我对这个人没有印象的原因是他演的角色很多都是“被包装起来的”,例如指环王里面的“咕噜”(👽),猩球崛起里面的“凯撒”(🦍),还有金刚里面的 kong(🦍)。

语言分布

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
douban %>% 
unnest_tokens(language, language, token = stringr::str_split, pattern = "/") %>%
count(language) %>%
arrange(desc(n)) %>%
filter(!is.na(language)) %>%
slice(1:10) %>%
ggplot(aes(x = language %>% fct_reorder(n), y = n)) +
geom_col(aes(fill = language)) +
guides(fill = "none") +
coord_flip() +
scale_fill_manual(values = colors) +
labs(x = "电影数量", y = "语言",
title = "我的观影语言分布 Top10",
caption = "数据来源:豆瓣电影\nhttps://movie.douban.com/people/132624622/collect")

再来看看不同语言的评分分布(我的评分):

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
douban %>% 
unnest_tokens(language, language, token = stringr::str_split, pattern = "/") %>%
filter(!is.na(language)) %>%
group_by(language) %>%
summarise(
meanrating = mean(rating, na.rm = T),
n = length(title)
) %>%
arrange(desc(n)) %>%
slice(1:10) %>%
arrange(desc(meanrating)) %>%
mutate(language = fct_reorder2(language, n, meanrating)) %>%
ggplot(aes(x = language, y = meanrating, size = n)) +
geom_point() +
scale_size_continuous("电影数量") +
labs(x = "语言", y = "平均得分",
title = "我对不同语言电影的平均评分",
caption = "数据来源:豆瓣电影\nhttps://movie.douban.com/people/132624622/collect")

发行年份分布

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
douban %>% 
mutate(
year = year(releasedate)
) %>%
unnest_tokens(genre, genre, token = stringr::str_split, pattern = "/") %>%
mutate(
genre = fct_lump(genre, n = 4),
genre = genre %>% str_replace("Other", "其它")
) %>%
select(title, year, genre) %>%
count(year, genre) %>%
mutate(
genre = fct_reorder2(genre, year, n)
) %>%
ggplot(aes(x = year, y = n, group = genre, color = genre)) +
geom_line() +
labs(x = "电影数量", y = "发行年份",
title = "我的观影发行年份分布",
caption = "数据来源:豆瓣电影\nhttps://movie.douban.com/people/132624622/collect") +
scale_color_brewer("电影类型", palette = "Paired")

还是 2000 年之后的电影看的比较多,之前的电影拍摄技术较差,我也不喜欢看。

我的评分与豆瓣评分对比

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
douban %>% 
mutate(
year = year(releasedate)
) %>%
group_by(year) %>%
summarise(
myrating = mean(rating * 2, na.rm = T),
doubanrating = mean(doubanrating, na.rm = T)
) %>%
filter(year >= 2000) %>%
ggplot() +
geom_col(aes(x = year, y = myrating, fill = factor(year))) +
geom_line(aes(x = year, y = doubanrating)) +
scale_fill_manual(values = colors) +
guides(fill = "none") +
labs(x = "电影平均得分", y = "发行年份",
title = "我的观影评分对比",
subtitle = "线条为豆瓣评分,柱条为我的评分",
caption = "数据来源:豆瓣电影\nhttps://movie.douban.com/people/132624622/collect")

可以看出,没什么关联。

片长分布

R
1
2
3
4
5
6
7
8
9
10
11
12
douban %>% 
filter(!is.na(duration)) %>%
ggplot(aes(x = duration)) +
geom_density() +
labs(x = "片长(分钟)", y = "电影数量",
title = "我的观影片长分布",
caption = "数据来源:豆瓣电影\nhttps://movie.douban.com/people/132624622/collect") +
theme(legend.position = "none") +
geom_text(aes(x = 45, y = 0.0025, label = "电视剧"),
family = 'STSong') +
geom_text(aes(x = 105, y = 0.02, label = "电影"),
family = 'STSong')

还可以看看片长与评分的关系:

首先是片长与我的评分的关系:

R
1
2
3
4
5
6
7
douban %>% 
ggplot(aes(x = duration, y = rating)) +
geom_point(position = 'jitter') +
geom_smooth() +
labs(x = "片长(分钟)", y = "我的评分",
title = "片长 vs. 我的评分",
caption = "数据来源:豆瓣电影\nhttps://movie.douban.com/people/132624622/collect")

然后是片长和豆瓣评分的关系:

R
1
2
3
4
5
6
7
douban %>% 
ggplot(aes(x = duration, y = doubanrating)) +
geom_point(position = 'jitter') +
geom_smooth() +
labs(x = "片长(分钟)", y = "豆瓣评分",
title = "片长 vs. 豆瓣评分",
caption = "数据来源:豆瓣电影\nhttps://movie.douban.com/people/132624622/collect")

似乎是,一部电影的内容越实在、片长越长,它的评分往往越高。

我的观影评分和豆瓣观影评分的关系

R
1
2
3
4
5
6
7
8
9
douban %>% 
ggplot(aes(x = doubanrating, y = rating * 2)) +
geom_point(position = "jitter", alpha = 0.6) +
geom_smooth() +
geom_abline(slope = 1, xintercept = 0, yintercept = 0, color = "#f17c67") +
labs(x = "豆瓣评分", y = "我的评分",
title = "我的观影评分与豆瓣评分的关系",
caption = "数据来源:豆瓣电影\nhttps://movie.douban.com/people/132624622/collect") +
xlim(2, 10) + ylim(2, 10)

橘色线条以下的散点就是我给出的评分低于豆瓣大佬的评分的电影了。

发行日期和豆瓣得分

R
1
2
3
4
5
6
7
8
douban %>% 
ggplot(aes(x = releasedate, y = doubanrating)) +
geom_point(aes(size = doubanratingpeoplenum), alpha = 0.4) +
geom_smooth() +
scale_size_continuous("评分人数") +
labs(x = "发行日期", y = "豆瓣评分",
title = "我的观影——发行日期和豆瓣得分",
caption = "数据来源:豆瓣电影\nhttps://movie.douban.com/people/132624622/collect")

发行日期与我的评分

R
1
2
3
4
5
6
7
8
douban %>% 
ggplot(aes(x = releasedate, y = rating * 2)) +
geom_point(alpha = 0.4, position = "jitter", aes(color = language)) +
geom_smooth() +
labs(x = "发行日期", y = "我的评分",
title = "我的观影——发行日期和我的评分",
caption = "数据来源:豆瓣电影\nhttps://movie.douban.com/people/132624622/collect") +
theme(legend.position = 'none')

这里的颜色是闲着没事加上去的,大概绿色表示英语电影、红色表示华语电影。

2012 年前后烂片不少。

电影词云图

R
1
2
library(d3wordcloud)
d3wordcloud(douban$title, douban$rating)

这个包画的词云不是很好看呢。

R
1
2
3
4
5
6
7
library(rWordCloud)
douban_wordcloud <- douban %>%
select(title, rating) %>%
mutate(rating = 1000^(rating)) %>%
arrange(desc(rating))
d3Cloud(douban_wordcloud$title, douban_wordcloud$rating,
width = "100%", height = "620px")

豆瓣得分 Top20

R
1
2
3
4
5
6
douban %>% 
mutate(img = paste0('<img src="', img, '" width="25%"/>')) %>%
arrange(desc(doubanrating)) %>%
select(title, doubanrating, img) %>%
slice(1:20) %>%
knitr::kable(align = "c")
标题 豆瓣评分 海报
肖申克的救赎 9.6
辛德勒的名单 9.5
机器人总动员 9.3
盗梦空间 9.3
泰坦尼克号 9.3
疯狂动物城 9.2
星际穿越 9.2
大话西游之大圣娶亲 9.2
三傻大闹宝莱坞 9.2
楚门的世界 9.2
蝙蝠侠:黑暗骑士 9.1
末代皇帝 9.1
2112 年哆啦 A 梦诞生 9.1
指环王 3:王者无敌 9.1
指环王 2:双塔奇兵 9.0
死亡诗社 9.0
罗马假日 9.0
少年派的奇幻漂流 9.0
当幸福来敲门 9.0
黑客帝国 8.9
# R

Comments

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×