mmjpgcom 网站 2018 年年终报告

mmjpgcom 网站 2018 年年终报告

自己布置的作业果然还是要自己完成,这是另外一个妹子图网站:妹子图 - 每日分享高清美女图片。同样这个网站也有一个手机版:妹子图手机版。不过这个网站上图片的类别要进入图片主页才能看到,这样就比较费时间了,所以这次不爬分类了。结果发现,2018 年对这个网站来说也是惨淡的一年。

首先爬取思路是一样的:

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
library(rvest)
# 首页的网址:http://www.mmjpg.com/
# 第一页:http://www.mmjpg.com/
# 第二页:http://www.mmjpg.com/home/2

# 首先爬取总页数
html <- read_html('http://www.mmjpg.com/')
(page_num <- html %>%
html_nodes(xpath = '/html/body/div[2]/div[1]/div/div/a[8]') %>%
html_attr('href') %>%
gsub(pattern = '/home/', replacement = '') %>%
as.numeric())

# 第一页第一幅套图的:
# 标题和链接的XPath:/html/body/div[2]/div[1]/ul/li[1]/span[1]/a
# 日期的XPath:/html/body/div[2]/div[1]/ul/li[1]/span[2]
# 浏览量的XPath:/html/body/div[2]/div[1]/ul/li[1]/span[3]

# 第一页最后一幅套图的:
# 标题和链接的XPath:/html/body/div[2]/div[1]/ul/li[15]/span[1]/a
# 日期的XPath:/html/body/div[2]/div[1]/ul/li[15]/span[2]
# 浏览量的XPath:/html/body/div[2]/div[1]/ul/li[15]/span[3]

# 可以看出,每一页有15组图片。XPath的规律很明显。

# 开始爬取,由于第一页的网址和后面页面的网址规律不一样,所以单独爬第一页:
df <- data.frame(
title = html %>% html_node(xpath = '/html/body/div[2]/div[1]/ul/li[1]/span[1]/a') %>% html_text(),
link = html %>% html_node(xpath = '/html/body/div[2]/div[1]/ul/li[1]/span[1]/a') %>% html_attr('href'),
date = html %>% html_node(xpath = '/html/body/div[2]/div[1]/ul/li[1]/span[2]') %>% html_text(),
view = html %>% html_node(xpath = '/html/body/div[2]/div[1]/ul/li[1]/span[3]') %>% html_text()
)

for(i in 2:15){
try(
df <- rbind(
df,
data.frame(
title = html %>% html_node(xpath = paste0('/html/body/div[2]/div[1]/ul/li[', i, ']/span[1]/a')) %>% html_text(),
link = html %>% html_node(xpath = paste0('/html/body/div[2]/div[1]/ul/li[', i, ']/span[1]/a')) %>% html_attr('href'),
date = html %>% html_node(xpath = paste0('/html/body/div[2]/div[1]/ul/li[', i, ']/span[2]')) %>% html_text(),
view = html %>% html_node(xpath = paste0('/html/body/div[2]/div[1]/ul/li[', i, ']/span[3]')) %>% html_text()
)
)
)
}

library(progress)
pb <- progress_bar$new(total = page_num - 1)
for(m in 2:page_num){
pb$tick(0)
pb$tick()
html <- read_html(paste0('http://www.mmjpg.com/home/', m))
for(i in 1:15){
try(
df <- rbind(
df,
data.frame(
title = html %>% html_node(xpath = paste0('/html/body/div[2]/div[1]/ul/li[', i, ']/span[1]/a')) %>% html_text(),
link = html %>% html_node(xpath = paste0('/html/body/div[2]/div[1]/ul/li[', i, ']/span[1]/a')) %>% html_attr('href'),
date = html %>% html_node(xpath = paste0('/html/body/div[2]/div[1]/ul/li[', i, ']/span[2]')) %>% html_text(),
view = html %>% html_node(xpath = paste0('/html/body/div[2]/div[1]/ul/li[', i, ']/span[3]')) %>% html_text()
)
)
)
}
}

# 赶紧把爬取结果保存起来
library(io)
# qwrite(df, 'mmjpg.rds')

如果你不想爬的话,可以直接下载这个 rds 文件:

mmjpg.rds

先整理数据:

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# 整理数据
df <- qread('mmjpg.rds')
df <- subset(df, !is.na(df$title))

# 主要是日期变量需要仔细修改:
df$date <- c(
paste0('2019-', df[1:7,]$date),
paste0('2018-', df[8:372,]$date),
paste0('2017-', df[373:737,]$date),
paste0('2016-', df[738:1102,]$date),
paste0('2015-', df[1103:1587,]$date)
)

df$date <- gsub(pattern = '45分钟前', replacement = '01-07', df$date)

df$date <- gsub(pattern = '1天前', replacement = '01-06', df$date)
df$date <- gsub(pattern = '2天前', replacement = '01-05', df$date)
df$date <- gsub(pattern = ' 发布', replacement = '', df$date)

library(lubridate)
df <- df %>%
mutate(
date = ymd(date),
view = gsub(pattern = '[浏览()]', replacement = '', view) %>% as.numeric(),
year = year(date),
month = month(date),
week = wday(date)
) %>%
arrange(desc(view))

最受喜爱的十组套图:

R
1
2
3
knitr::kable(df %>%
arrange(desc(view)) %>%
slice(1:10))
title date link
性感甜美颜值妹子丰满白嫩的蜜桃美臀 2017-07-08 http://www.mmjpg.com/mm/1039
臀控必看!美女伊琳硕大圆润的蜜桃翘臀 2016-10-01 http://www.mmjpg.com/mm/759
97 年嫩妹子木奈奈 34D 丰满美乳写真图片 2017-03-21 http://www.mmjpg.com/mm/930
不看后悔!嫩妹杏子童颜巨乳身材超赞 2017-06-04 http://www.mmjpg.com/mm/1005
太大了!爆乳妹子冰露性感身材让人心痒 2017-04-25 http://www.mmjpg.com/mm/965
清纯学妹一等一的绝美身材性感写真图 2017-07-30 http://www.mmjpg.com/mm/1061
尤物嫩模唐琪儿白嫩翘臀绝对迷晕你 2017-01-03 http://www.mmjpg.com/mm/853
极品 F 杯巨乳妹子桃乃香丰满美臀超性感 2017-03-07 http://www.mmjpg.com/mm/916
巨乳美女歆颜丰满大屁股十分勾人心弦 2017-02-21 http://www.mmjpg.com/mm/902
丝袜控必看!性感妹妹陌子诱惑写真作品 2016-08-27 http://www.mmjpg.com/mm/724

你永远想不明白为什么这十组图最受大家喜爱。

按月统计每个月的平均浏览量:

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
month_view <- df %>%
group_by(year, month) %>%
summarise(n = sum(view)/length(view)) %>%
ungroup()

month_view$ym <- paste0(month_view$year, '-' ,month_view$month, '-01') %>%
ymd()

# 2018年各个月份的平均浏览量
ggplot(data = subset(month_view, month_view$year == 2018)) +
geom_col(aes(x = factor(month), y = n, fill = factor(month))) +
scale_fill_brewer(palette = 'Paired') +
guides(fill = 'none') +
labs(x = '月份',
y = '平均浏览量',
title = '妹子图网站2018年月度平均浏览量',
subtitle = '基于2018年每个月的总浏览量除以套图量计算',
caption = '数据来源:妹子图\nhttp://www.mmjpg.com/') +
scale_x_discrete(
breaks = 1:12,
labels = formattable::suffix(1:12, "月份")
) +
scale_y_continuous(
breaks = seq(0, 600000, by = 200000),
labels = c('0', '20\n万次', '40\n万次', '60\n万次')
)

如果算总浏览次数的话:

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
df %>%
group_by(year, month) %>%
summarise(n = sum(view)) %>%
ungroup() %>%
subset(year == 2018) %>%
ggplot() +
geom_col(aes(x = factor(month), y = n, fill = factor(month))) +
scale_fill_brewer(palette = 'Paired') +
guides(fill = 'none') +
labs(x = '月份',
y = '总浏览量',
title = '妹子图网站2018年每个月总和浏览量',
subtitle = '基于2018年每个月的总浏览量统计',
caption = '数据来源:妹子图\nhttp://www.mmjpg.com/') +
scale_x_discrete(
breaks = 1:12,
labels = formattable::suffix(1:12, "月份")
) +
scale_y_continuous(
breaks = seq(0, 20000000, by = 5000000),
labels = c('0', '500\n万次', '10000\n万次', '1500\n万次', '2000\n万次')
)

可以看到,这个网站的浏览量远不如https://mzitu.com。

再来看 2015 年以来的月度平均浏览量:

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
df %>%
group_by(year, month) %>%
summarise(n = sum(view)/length(view)) %>%
ungroup() %>%
mutate(
ym = paste0(year, '-', month, '-01') %>% ymd()
) %>%
ggplot() +
geom_col(aes(x = ym, y = n, fill = factor(year))) +
scale_fill_brewer('年份', palette = 'Set1') +
labs(x = '年份',
y = '总浏览量',
title = '妹子图网站2015年以来月度平均浏览量',
subtitle = '基于每个月的总浏览量/发图量计算',
caption = '数据来源:妹子图\nhttp://www.mmjpg.com/') +
scale_y_continuous(
breaks = seq(0, 4000000, by = 1000000),
labels = c('0', '100\n万次', '200\n万次', '300\n万次', '400\n万次')
)

看来,2018 年对于这家图片网站也是惨淡的一年。

最后再绘制一幅词云图:

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
library(jiebaR)
engine_s <- worker(stop_word = "stopwords.txt", user = "dictionary.txt")
df$segment <- paste(segment(as.character(df$title), engine_s), collapse = ', ')
library(tidytext)

word_df <- df %>%
unnest_tokens(word, segment, token = stringr::str_split, pattern = ", ")

word_count <- word_df %>%
group_by(word) %>%
summarise(n = sum(view)) %>%
ungroup() %>%
arrange(desc(n)) %>%
slice(1:100)

library(rWordCloud)
d3Cloud(text = word_count$word, size = word_count$n)

# R

Comments

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×