行政单位名称的分析

行政单位名称的分析

最近微信上有一篇很火的微信推文:我们分析了67万个村名,找到了中国地名的秘密。由网易数独创作,感觉非常有趣,就自己模仿着做了一份。

数据爬取

这个分析最难的部分应该是数据爬取了。数据来源在这个网页:2017年统计用区划代码和城乡划分代码(截止2017年10月31日),这个爬取不是难,而是复杂,打开网页你会发现,首先是一个这样的省份表格:

例如点开安徽省的链接:

是不是已经发现复杂之处了。由于这种层层的网页结构,所以想要一页一页的爬基本是要浪费大半辈子的生命了。所以就有了下面的方法。

首先找一个能够进行网站内容离线下载的工具。我是用了这个:SiteSucker 2.11.8 Mac,相信Windows系统也有类似的软件,这个软件可以干嘛呢?简单的来说,就是可以把整个网站下载下来。当然我们不用下载整个网站,我们只需要下载:

1
http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2017/

下面的包含的部分。实际上整个网站就是一个层层的文件夹,
http://www.stats.gov.cn/
对应的就是根目录。然后
http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2017/
就是根目录文件夹中的子文件夹中的子文件夹···中的子文件夹,一层层的子文件夹,我们需要的就是名叫2017的子文件夹。所以我们把这个文件夹整个拷贝下面就行了。具体设置可以参考下面:

整个任务设置可以保存为一个suck文件:
行政区划爬取任务.suck

设置好任务点击开始下载即可,夜间进行即可,我是昨夜下载完的。

经过一夜的下载,得到了2017文件夹,打开文件夹就会发现这个文件夹结构确实很复杂,所以我们可以把所有的html文件检索出来放到一个文件夹里。直接移动即可,不过需要注意的是,有四万多个html,所以一次性移动很可能会让电脑卡死。

这就是最后所有的html文件了,一共46789个html文件,那么下面该怎么处理呢???一个个处理???

一个个的处理感觉有点不太现实,即使写好了程序,每处理一个文件要1秒,那么总共也要13个小时。所以我的策略是先把这些文件合并为一个文件(因为文件的结果一样,所以批量处理是可行的)。下面所有的操作都是在Stata中进行的。

Stata
1
2
3
4
5
6
cd ~/Desktop/行政区划
!ls > ~/Desktop/filename.txt
infix strL v 1-20000 using "~/Desktop/filename.txt", clear
forval i = 1/`=_N'{
!cat `=v[`i']' >> ~/Desktop/all.html
}

这里用了两个shell命令,第一个是列示当前文件夹里的所有文件的名称,然后使用>将结果导出保存到filename.txt文件中。第二个是cat命令,读取二进制文件的内容,然后写入all.html文件中的下一行。

实际上如果文件较少的话,直接使用
cat *.html >> ~/Desktop/all.html
就能实现多个文件的合并了。但是这里文件很多,所以这样是会出错的。

大约1个小时就能合并完成了,然后我们再读入stata处理一下:

Stata
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
cd ~/Desktop
* 转码
utrans all.html
infix strL v 1-20000 using all.html, clear
keep if index(v, "<tr class=") & !index(v, "head")
split v, parse("</tr>")
drop v
save temp.dta, replace
clear all
gen v = ""
save all, replace
use temp, clear
foreach i of varlist _all{
preserve
qui{
keep `i'
ren `i' v
drop if v == ""
append using all
save all, replace
}
di "`i'"
restore
}
use all, clear
drop if !index(v, "tr")
replace v = subinstr(v, "<br/>", "", .)
gen id = ustrregexs(1) if ustrregexm(v, `"<td>(.*)</td><td>(.*)</td><td>(.*)</td>"')
gen name = ustrregexs(3) if ustrregexm(v, `"<td>(.*)</td><td>(.*)</td><td>(.*)</td>"')
replace id = ustrregexs(3) if ustrregexm(v, `"<tr class='(.*)'><td><a href='(.*)'>(.*)</a></td><td><a href='(.*)'>(.*)</a></td>"') & id == ""
replace name = ustrregexs(5) if ustrregexm(v, `"<tr class='(.*)'><td><a href='(.*)'>(.*)</a></td><td><a href='(.*)'>(.*)</a></td>"') & name == ""
replace id = ustrregexs(2) if ustrregexm(v, `"<tr class='(.*)'><td>(.*)</td><td>(.*)</td>"') & id == ""
replace name = ustrregexs(3) if ustrregexm(v, `"<tr class='(.*)'><td>(.*)</td><td>(.*)</td>"') & name == ""
replace id = ustrregexs(2) if ustrregexm(v, `"<tr class='(.*)'><td>(.*)</td>"') & id == ""
compress
destring id, replace force
drop if index(v, "provincetr")
drop if v == "<tr class='provincetr'><td></td><td></td>"
replace name = "谷脚村" if id == 430524108245
replace name = "石阳桥村" if id == 430524103230
replace name = "洞头印村" if id == 430524117216
replace name = "章几塘村" if id == 430524111221
gen class = ustrregexs(1) if ustrregexm(v, `"<tr class='(.*)'><"')
gen cid = ustrregexs(1) if ustrregexm(v, `"</td><td>(.*)</td><td>"')
drop v
format name %20s
format class %10s
destring cid, replace force
format id %12.0f
keep id name
label var id "统计用区划代码"
label var name "名称"
label var cid "城乡分类代码"
label var class "行政区划级别"
replace class = subinstr(class, "tr", "", .)
replace class = "市" if class == "city"
replace class = "县" if class == "county"
replace class = "乡镇" if class == "town"
replace class = "村/居委会" if class == "village"
save 中国行政区划, replace

在html文件中,每个表格的源代码在一行,所以我先是按照html换行符split,然后在通过每次保留一个变量再append将所有的变量拼接成一个变量,再然后使用正则表达式提取行政区划代码和名称以及城乡分类代码。由于这个结果文件非常大(232.6M),所以我不能在这里附上链接,不过如果你需要,可以留言索要。

cid是城乡分类代码,具体含义可以在百度百科中查到:统计用区划代码和统计用城乡划分代码

统计用区划代码 城乡分类代码
111 主城区
112 城乡结合区
121 镇中心区
122 镇乡结合区
123 特殊区域
210 乡中心区
220 村庄

城乡分类代码第13位为1,表示城镇;第13位为2,表示乡村。

统计区划代码一共12位,各层代码由左起表示:

级别
1、2位 省级码段
3、4位 地级码段
5、6位 县级码段
7~9位 乡级码段
10~12位 村级码段

再然后为了绘制推文里面的地理分布图,我们需要每个村的的经纬度,但是这里有72万个地点,基本上不可能一个个查询经纬度了,所以只能退而求其次,用每个村所在的县级行政单位作为其地理坐标。我的cuse数据库里面有一个各个县的地理坐标 及统计区划代码的数据集:countycode.dta。下面将上面的那个数据处理一下和这个数据集合并即可:

Stata
1
2
3
4
5
6
7
8
9
10
cuse countycode, clear
save countycode, replace
use 中国行政区划, clear
gen county_code = strofreal(id, "%12.0f")
replace county_code = substr(county_code, 1, 6)
destring county_code, replace
merge m:1 county_code using countycode
keep if _m == 3
drop _m
save 中国行政区划+经纬度, replace

同样,因为这个数据集过大,所以我不能直接在这里挂链接。

现在基本数据OK了,下面可以进行分析了。

分析

数据基本情况

Stata
1
2
3
4
5
6
7
8
9
10
11
12
. use 中国行政区划, clear

. tab class

行政区划级别 | Freq. Percent Cum.
----------------+-----------------------------------
乡镇 | 43,523 6.04 6.04
县 | 3,287 0.46 6.50
市 | 343 0.05 6.55
村/居委会 | 673,028 93.45 100.00
----------------+-----------------------------------
Total | 720,181 100.00

67万个行政村!结果和推文里的数目基本一样。需要注意的是这个是行政村而非自然村。一个行政村里面可能包含多个自然村。

重名的村子

首先把村子的“纯名字”提取出来:

Stata
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
use 中国行政区划, clear
replace name = subinstr(name, "村委会", "", .)
replace name = subinstr(name, "村民委员会", "", .)
replace name = subinstr(name, "居委会", "", .)
replace name = subinstr(name, "居民委员会", "", .)
replace name = subinstr(name, "社区", "", .)
replace name = subinstr(name, "街道办事处", "", .)
replace name = subinstr(name, "村", "", .)
* 一般村庄的名字都是前两个字,所以还是只保留前两个字,后面的内容应该算是“姓”
* replace name = substr(name, 1, 6)
contract name
gsort -_freq
keep in 1/500
ren name name
ren _freq freq
save 中国完全同名村庄排行榜, replace

Stata的绘图功能不如R的完善,所以下面用R的ggplot2绘图。

首先绘制一幅Top10的条形图:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
library(readstata13)
df1 <- read.dta13("中国完全同名村庄排行榜.dta")
df1$name <- factor(df1$name, levels = df1[order(df1$freq),]$name)
library(ggplot2)
library(cowplot)
theme_set(theme_bw(base_size = 20, base_family = "STSong") +
theme(plot.title = element_text(hjust = 0.1,
family = "STSongti-SC-Bold")))
p1 <- ggplot(df1[1:10, ]) +
geom_bar(aes(x = name, y = freq, fill = name), stat = "identity") +
geom_text(aes(x = name, y = freq - 30, label = freq), colour = "white", size = 8, family = "STSong") +
geom_text(aes(x = name, y = 900, label = paste("| ", name), colour = name), family = 'STSongti-SC-Bold') +
coord_flip() +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank()) +
scale_fill_brewer(palette = "Paired") +
scale_colour_brewer(palette = "Paired") +
labs(title = "中国完全同名村庄排行榜\n") +
theme_void(base_size = 20, base_family = "STSong") +
theme(plot.title = element_text(hjust = 0.1,
family = "STSongti-SC-Bold")) +
theme(plot.margin = grid::unit(c(0.8, 0.8, 1.8, 0.8), "cm")) +
theme(legend.position = "none")

ggdraw(p1) +
draw_label("单位:个", x = 0.9, y = 0.9, fontfamily = 'STSong', size = 14) +
draw_label("数据来源:国家统计局", x = 0.87, y = 0.05, fontfamily = 'STSong', size = 14) +
draw_image("https://www.czxa.top/images/default28.png",
x = 0.7, y = 0.02, width = 0.06, height = 0.06)

中国完全同名村庄排行榜

不过这个画个词云图不也挺有趣的么?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
library(ggwordcloud)
df1$name <- as.character(df1$name)
p2 <- ggplot(df1[1:50,], aes(label = name, size = freq, colour = name)) +
geom_text_wordcloud(aes(angle = rep(c(0, 90), 25)), family = 'STSong',
shape = "circle") +
scale_size_area(max_size = 20) +
labs(title = '中国完全同名的村庄') +
theme_void(base_size = 20, base_family = "STSong") +
theme(plot.title = element_text(hjust = 0.1,
family = "STSongti-SC-Bold")) +
theme(plot.margin = grid::unit(c(0.8, 0.8, 1.8, 0.8), "cm"))

ggdraw(p2) +
draw_label("数据来源:国家统计局", x = 0.87, y = 0.05, fontfamily = 'STSong', size = 14) +
draw_image("https://www.czxa.top/images/default28.png",
x = 0.7, y = 0.02, width = 0.06, height = 0.06)

中国完全同名的村庄

村庄的姓氏

首先用Stata整理一下数据:

Stata
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
use 中国行政区划, clear
* 可以通过前两个字分辨是否为姓氏
replace name = substr(name, 1, 6)
replace name = subinstr(name, "家", "", .)
replace name = subinstr(name, "庄", "", .)
replace name = subinstr(name, "村", "", .)
replace name = subinstr(name, "南", "", .)
replace name = subinstr(name, "北", "", .)
replace name = subinstr(name, "西", "", .)
replace name = subinstr(name, "东", "", .)
contract name
gsort -_freq
drop if name == ""
* 删除一些主要不作为姓氏的
drop in 8/10
drop in 6
drop in 10/12
drop in 13/14
drop in 18
keep in 1/20
save 中国村庄的姓氏, replace

然后用R绘图:

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
df2 <- read.dta13("中国村庄的姓氏.dta")
df2$name <- factor(df2$name, levels = df2[order(df1$freq),]$name)
p3 <- ggplot(df2[1:10, ]) +
geom_bar(aes(x = name, y = freq, fill = name), stat = "identity") +
geom_text(aes(x = name, y = freq - 150, label = freq), colour = "white", size = 7, family = "STSong") +
geom_text(aes(x = name, y = 4100, label = paste("| ", name), colour = name), family = 'STSongti-SC-Bold') +
coord_flip() +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank()) +
scale_fill_brewer(palette = "Paired") +
scale_colour_brewer(palette = "Paired") +
labs(title = "中国村庄的姓氏排行榜\n") +
theme_void(base_size = 20, base_family = "STSong") +
theme(plot.title = element_text(hjust = 0.1,
family = "STSongti-SC-Bold")) +
theme(plot.margin = grid::unit(c(0.8, 0.8, 1.8, 0.8), "cm")) +
theme(legend.position = "none")

ggdraw(p3) +
draw_label("单位:个", x = 0.9, y = 0.9, fontfamily = 'STSong', size = 14) +
draw_label("数据来源:国家统计局", x = 0.87, y = 0.05, fontfamily = 'STSong', size = 14) +
draw_image("https://www.czxa.top/images/default28.png",
x = 0.7, y = 0.02, width = 0.06, height = 0.06)

中国村庄的姓氏排行榜

王姓果然第一大姓,我们家除了我的女朋友、我的姐夫,所以嫁娶过来的都是姓王的。。。

程家村子的分布

下面当然要看一下程家的村子都分布在哪里了,先把程家的村子筛选出来:

Stata
1
2
3
4
use 中国行政区划+经纬度, clear
keep if index(name, "程")
drop if index(name, "工程")
save 程家村子的分布, replace

然后把这些村子的坐标映射到地图上,地图就用之前画的吧:

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
df3 <- read.dta13("程家村子的分布.dta")
library(sf)
library(ggspatial)
theme_set(theme_bw(base_size = 20, base_family = "STSong") + theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5)))
library(rworldmap)
library(rworldxtra)
world <- getMap(resolution = "high")
world <- st_as_sf(world)
country_name_df <- read.csv("国名中英文对照表.csv")
country_name_df <- data.frame(country_name_df$三位.字母, country_name_df$中国.惯用名)
colnames(country_name_df) <- c("ne_10m_adm", "cname")
rownames(country_name_df) <- NULL
df <- merge(world, country_name_df, by = "ne_10m_adm")
p4 <- ggplot(data = df) +
geom_sf(fill = "antiquewhite1") +
geom_point(data = df3, aes(x = longitude, y = latitude,
colour = province)) +
geom_text(aes(LON, LAT, label = cname),
size = 4, hjust = "left", color = "darkblue",
fontface = "bold", check_overlap = T, family = "STSong") +
annotate(geom = "text", x = 115, y = 15, label = "中国南海",
fontface = "italic", color = "darkblue",
size = 4, family = "STSong") +
coord_sf(xlim = c(140, 60), ylim = c(5, 60), expand = F) +
annotation_scale(location = "bl", width_hint = 0.5) +
annotation_north_arrow(location = "bl", which_north = "true",
pad_x = unit(0.75, "cm"),
pad_y = unit(0.5, "cm"),
style = north_arrow_fancy_orienteering) +
labs(title = "程家村子的分布\n") +
theme(panel.grid.major = element_line(color = gray(0.5),
linetype = "dashed",
size = 0.5),
panel.background = element_rect(fill = "aliceblue")) +
theme(plot.title = element_text(hjust = 0.1,
family = "STSongti-SC-Bold")) +
theme(plot.margin = grid::unit(c(1, 0.3, 1.8, 0.3), "cm")) +
theme(axis.title = element_blank()) +
theme(legend.position = "none")
ggdraw(p4) +
draw_label("数据来源:国家统计局", x = 0.87, y = 0.05, fontfamily = 'STSong', size = 14) +
draw_image("https://www.czxa.top/images/default28.png",
x = 0.7, y = 0.02, width = 0.06, height = 0.06)

程家村子的分布

好吧,如果你遇到一个姓程的人,那你可以这么问他:

你:你是河南的?
我:不是,我是安徽北部的,靠近河南那边。

刘家村子的分布

再看看我女朋友的姓:

Stata
1
2
3
use 中国行政区划+经纬度, clear
keep if index(name, "刘")
save 刘家村子的分布, replace

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
df4 <- read.dta13("刘家村子的分布.dta")
p5 <- ggplot(data = df) +
geom_sf(fill = "antiquewhite1") +
geom_point(data = df4, aes(x = longitude, y = latitude,
colour = province)) +
geom_text(aes(LON, LAT, label = cname),
size = 4, hjust = "left", color = "darkblue",
fontface = "bold", check_overlap = T, family = "STSong") +
annotate(geom = "text", x = 115, y = 15, label = "中国南海",
fontface = "italic", color = "darkblue",
size = 4, family = "STSong") +
coord_sf(xlim = c(140, 60), ylim = c(5, 60), expand = F) +
annotation_scale(location = "bl", width_hint = 0.5) +
annotation_north_arrow(location = "bl", which_north = "true",
pad_x = unit(0.75, "cm"),
pad_y = unit(0.5, "cm"),
style = north_arrow_fancy_orienteering) +
labs(title = "刘家村子的分布\n") +
theme(panel.grid.major = element_line(color = gray(0.5),
linetype = "dashed",
size = 0.5),
panel.background = element_rect(fill = "aliceblue")) +
theme(plot.title = element_text(hjust = 0.1,
family = "STSongti-SC-Bold")) +
theme(plot.margin = grid::unit(c(1, 0.3, 1.8, 0.3), "cm")) +
theme(axis.title = element_blank()) +
theme(legend.position = "none")
ggdraw(p5) +
draw_label("数据来源:国家统计局", x = 0.87, y = 0.05, fontfamily = 'STSong', size = 14) +
draw_image("https://www.czxa.top/images/default28.png",
x = 0.7, y = 0.02, width = 0.06, height = 0.06)

刘家村子的分布

大姓果然分布很广!

丁姓的分布

临时插一小节,看看丁文亮的丁家村的分布:

Stata
1
2
3
use 中国行政区划+经纬度, clear
keep if index(name, "丁")
save 丁家村子的分布, replace

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
df5 <- read.dta13("丁家村子的分布.dta")
p6 <- ggplot(data = df) +
geom_sf(fill = "antiquewhite1") +
geom_point(data = df5, aes(x = longitude, y = latitude,
colour = province)) +
geom_text(aes(LON, LAT, label = cname),
size = 4, hjust = "left", color = "darkblue",
fontface = "bold", check_overlap = T, family = "STSong") +
annotate(geom = "text", x = 115, y = 15, label = "中国南海",
fontface = "italic", color = "darkblue",
size = 4, family = "STSong") +
coord_sf(xlim = c(140, 60), ylim = c(5, 60), expand = F) +
annotation_scale(location = "bl", width_hint = 0.5) +
annotation_north_arrow(location = "bl", which_north = "true",
pad_x = unit(0.75, "cm"),
pad_y = unit(0.5, "cm"),
style = north_arrow_fancy_orienteering) +
labs(title = "丁家村子的分布\n") +
theme(panel.grid.major = element_line(color = gray(0.5),
linetype = "dashed",
size = 0.5),
panel.background = element_rect(fill = "aliceblue")) +
theme(plot.title = element_text(hjust = 0.1,
family = "STSongti-SC-Bold")) +
theme(plot.margin = grid::unit(c(1, 0.3, 1.8, 0.3), "cm")) +
theme(axis.title = element_blank()) +
theme(legend.position = "none")
ggdraw(p6) +
draw_label("数据来源:国家统计局", x = 0.87, y = 0.05, fontfamily = 'STSong', size = 14) +
draw_image("https://www.czxa.top/images/default28.png",
x = 0.7, y = 0.02, width = 0.06, height = 0.06)

丁家村子的分布

程姓村子的省份分布

这个图是直接使用Stata绘制的:

Stata
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
cuse china_label, clear
replace name = subinstr(name, "省", "", .)
replace name = subinstr(name, "市", "", .)
replace name = subinstr(name, "自治区", "", .)
replace name = subinstr(name, "维吾尔", "", .)
replace name = subinstr(name, "回族", "", .)
replace name = subinstr(name, "壮族", "", .)
replace name = subinstr(name, "特别行政区", "", .)
ren name province
save china_label, replace
use 程家村子的分布, clear
contract province
merge 1:m province using china_label
replace _freq = 0 if _freq == .
spmap _freq using "china_map.dta", ///
id(id) label(label(province) xcoord(x_coord) ycoord(y_coord) size(*0.66)) ///
ti("图:“程家”村子的分布") subti("程振兴") caption("数据来源:中国统计局")

程姓村子的省份分布

红色村名

中国有很多村子很具有时代特征,名字又红又专,这里仿照推文统计了几个红色村民的频数:

Stata
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
use 中国行政区划, clear
keep if index(name, "和平") | ///
index(name, "团结") | ///
index(name, "胜利") | ///
index(name, "朝阳") | ///
index(name, "向阳") | ///
index(name, "红旗") | ///
index(name, "光明") | ///
index(name, "前进") | ///
index(name, "红星") | ///
index(name, "新民")

replace name = "和平" if index(name, "和平")
replace name = "团结" if index(name, "团结")
replace name = "胜利" if index(name, "胜利")
replace name = "朝阳" if index(name, "朝阳")
replace name = "向阳" if index(name, "向阳")
replace name = "红旗" if index(name, "红旗")
replace name = "光明" if index(name, "光明")
replace name = "前进" if index(name, "前进")
replace name = "红星" if index(name, "红星")
replace name = "新民" if index(name, "新民")
contract name
gsort -_freq
ren _freq freq
save 红色村名, replace

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
df6 <- read.dta13("红色村名.dta")
df6$name <- factor(df6$name, levels = df6[order(df6$freq),]$name)
(p7 <- ggplot(data = df6) +
geom_bar(aes(x = name, y = freq, fill = name), stat = "identity") +
geom_text(aes(x = name, y = freq - 50, label = freq), colour = "white", size = 7, family = "STSong") +
geom_text(aes(x = name, y = 1200, label = paste("| ", name), colour = name), family = 'STSongti-SC-Bold') +
coord_flip() +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank()) +
scale_fill_brewer(palette = "Paired") +
scale_colour_brewer(palette = "Paired") +
labs(title = "红色村名排行榜\n") +
theme_void(base_size = 20, base_family = "STSong") +
theme(plot.title = element_text(hjust = 0.1,
family = "STSongti-SC-Bold")) +
theme(plot.margin = grid::unit(c(0.8, 0.8, 1.8, 0.8), "cm")) +
theme(legend.position = "none"))

ggdraw(p7) +
draw_label("单位:个", x = 0.9, y = 0.9, fontfamily = 'STSong', size = 14) +
draw_label("数据来源:国家统计局", x = 0.87, y = 0.05, fontfamily = 'STSong', size = 14) +
draw_image("https://www.czxa.top/images/default28.png",
x = 0.7, y = 0.02, width = 0.06, height = 0.06)

红色村名排行榜

村子的位置

这个主要是看哪些村子位于沟里,哪些村子又位于坝上。。。

Stata
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
use 中国行政区划, clear
keep if index(name, "山") | ///
index(name, "河") | ///
index(name, "沟") | ///
index(name, "湾") | ///
index(name, "坪") | ///
index(name, "塘") | ///
index(name, "岭") | ///
index(name, "屯") | ///
index(name, "湖") | ///
index(name, "岗") | ///
index(name, "坝") | ///
index(name, "坡") | ///
index(name, "堡") | ///
index(name, "峪")

replace name = "山" if index(name, "山")
replace name = "河" if index(name, "河")
replace name = "沟" if index(name, "沟")
replace name = "湾" if index(name, "湾")
replace name = "坪" if index(name, "坪")
replace name = "塘" if index(name, "塘")
replace name = "岭" if index(name, "岭")
replace name = "屯" if index(name, "屯")
replace name = "湖" if index(name, "湖")
replace name = "岗" if index(name, "岗")
replace name = "坝" if index(name, "坝")
replace name = "坡" if index(name, "坡")
replace name = "堡" if index(name, "堡")
replace name = "峪" if index(name, "峪")
contract name
gsort -_freq
ren _freq freq
save 村庄位置, replace

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
df7 <- read.dta13("村庄位置.dta")
df7$name <- factor(df7$name, levels = df7[order(df7$freq),]$name)
(p8 <- ggplot(data = df7[1:10, ]) +
geom_bar(aes(x = name, y = freq, fill = name), stat = "identity") +
geom_text(aes(x = name, y = freq - 1200, label = freq), colour = "white", size = 7, family = "STSong") +
geom_text(aes(x = name, y = 28000, label = paste("| ", name), colour = name), family = 'STSongti-SC-Bold') +
coord_flip() +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank()) +
scale_fill_brewer(palette = "Paired") +
scale_colour_brewer(palette = "Paired") +
labs(title = "你的村子在沟里还是在屯里?\n") +
theme_void(base_size = 20, base_family = "STSong") +
theme(plot.title = element_text(hjust = 0.1,
family = "STSongti-SC-Bold")) +
theme(plot.margin = grid::unit(c(0.8, 0.8, 1.8, 0.8), "cm")) +
theme(legend.position = "none"))

ggdraw(p8) +
draw_label("单位:个", x = 0.9, y = 0.9, fontfamily = 'STSong', size = 14) +
draw_label("数据来源:国家统计局", x = 0.87, y = 0.05, fontfamily = 'STSong', size = 14) +
draw_image("https://www.czxa.top/images/default28.png",
x = 0.7, y = 0.02, width = 0.06, height = 0.06)

忘记放图片了。。。

村庄的颜色

  • 这里存在一个偏误就是一个村庄的名字可能含多种颜色。
    Stata
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    use 中国行政区划, clear
    keep if index(name, "红") | ///
    index(name, "黄") | ///
    index(name, "蓝") | ///
    index(name, "白") | ///
    index(name, "黑") | ///
    index(name, "蓝") | ///
    index(name, "绿") | ///
    index(name, "乌") | ///
    index(name, "赤") | ///
    index(name, "紫") | ///
    index(name, "青") | ///
    index(name, "灰") | ///
    index(name, "粉") | ///
    index(name, "棕") | ///
    index(name, "橙") | ///
    index(name, "褐")

    replace name = "红" if index(name, "红")
    replace name = "黄" if index(name, "黄")
    replace name = "蓝" if index(name, "蓝")
    replace name = "白" if index(name, "白")
    replace name = "黑" if index(name, "黑")
    replace name = "蓝" if index(name, "蓝")
    replace name = "绿" if index(name, "绿")
    replace name = "黑" if index(name, "乌")
    replace name = "红" if index(name, "赤")
    replace name = "紫" if index(name, "紫")
    replace name = "青" if index(name, "青")
    replace name = "灰" if index(name, "灰")
    replace name = "粉" if index(name, "粉")
    replace name = "棕" if index(name, "棕")
    replace name = "橙" if index(name, "橙")
    replace name = "褐" if index(name, "褐")
    contract name
    gsort -_freq
    ren _freq freq
    save 村庄的颜色, replace
R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
df8 <- read.dta13("村庄的颜色.dta")
df8$name <- factor(df8$name, levels = df8[order(df8$freq),]$name)
df8$color <- c(
"#ffff00", "#ffffff", "#ff0000", "#00ffff", "#000000",
"#9900ff", "#00ff00", "#0000ff", "#808080", "#ffc0cb",
"", "", ""
)
(p9 <- ggplot(data = df8[1:10, ]) +
geom_bar(aes(x = name, y = freq, fill = I(color)), stat = "identity") +
geom_text(aes(x = name, y = freq + 300, label = freq, colour = I(color)), size = 7, family = "STSong") +
geom_text(aes(x = name, y = 9000, label = paste("| ", name), colour = I(color)), family = 'STSongti-SC-Bold') +
coord_flip() +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank()) +
labs(title = "你的村子是啥色的?\n") +
theme_void(base_size = 20, base_family = "STSong") +
theme(plot.title = element_text(hjust = 0.1,
family = "STSongti-SC-Bold",
colour = "white")) +
theme(plot.margin = grid::unit(c(0.8, 0.8, 1.8, 0.8), "cm")) +
theme(legend.position = "none")) +
theme(panel.background = element_rect(fill = "grey20")) +
theme(plot.background = element_rect(fill = "grey20"))

ggdraw(p9) +
draw_label("单位:个", x = 0.9, y = 0.9, fontfamily = 'STSong', size = 14, colour = "white") +
draw_label("数据来源:国家统计局", x = 0.87, y = 0.05, fontfamily = 'STSong', size = 14, colour = "white") +
draw_image("https://www.czxa.top/images/default28.png",
x = 0.7, y = 0.02, width = 0.06, height = 0.06) +
theme(panel.background = element_rect(fill = "grey20")) +
theme(plot.background = element_rect(fill = "grey20"))

你的村子是啥色的?

五行分布

中国人很讲究这个五行,五行缺啥补啥。 那么村子们都缺些什么呢?

Stata
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
use 中国行政区划, clear
keep if index(name, "金") | ///
index(name, "木") | ///
index(name, "水") | ///
index(name, "火") | ///
index(name, "土")

replace name = "金" if index(name, "金")
replace name = "木" if index(name, "木")
replace name = "水" if index(name, "水")
replace name = "火" if index(name, "火")
replace name = "土" if index(name, "土")
contract name
gsort -_freq
ren _freq freq
save 金木水火土的分布, replace

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
df9 <- read.dta13("金木水火土的分布.dta")
df9$name <- factor(df9$name, levels = df9[order(df8$freq),]$name)
df9$color <- c(
"#5e7ce2", "#ebaf60", "#6c3a27", "#91684a", "#e62739"
)
(p9 <- ggplot(data = df9) +
geom_bar(aes(x = name, y = freq, fill = I(color)), stat = "identity") +
geom_text(aes(x = name, y = freq + 400, label = freq, colour = I(color)), size = 7, family = "STSong") +
geom_text(aes(x = name, y = 9500, label = paste("| ", name), colour = I(color)), family = 'STSongti-SC-Bold') +
coord_flip() +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank()) +
labs(title = "你的村子里五行缺啥?\n") +
theme_void(base_size = 20, base_family = "STSong") +
theme(plot.title = element_text(hjust = 0.1,
family = "STSongti-SC-Bold",
colour = "white")) +
theme(plot.margin = grid::unit(c(0.8, 0.8, 1.8, 0.8), "cm")) +
theme(legend.position = "none")) +
theme(panel.background = element_rect(fill = "grey60")) +
theme(plot.background = element_rect(fill = "grey60"))

ggdraw(p9) +
draw_label("单位:个", x = 0.9, y = 0.9, fontfamily = 'STSong', size = 14, colour = "white") +
draw_label("数据来源:国家统计局", x = 0.87, y = 0.05, fontfamily = 'STSong', size = 14, colour = "white") +
draw_image("https://www.czxa.top/images/default28.png",
x = 0.7, y = 0.02, width = 0.06, height = 0.06) +
theme(panel.background = element_rect(fill = "grey20")) +
theme(plot.background = element_rect(fill = "grey20"))

你的村子里五行缺啥?

还可以用ggimage包进行图片映射:
五行图片.zip

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
library(ggimage)
df9$img <- paste0(df9$name, ".png")
(p9 <- ggplot(data = df9) +
geom_bar(aes(x = name, y = freq, fill = I(color)), stat = "identity") +
geom_text(aes(x = name, y = freq + 400, label = freq, colour = I(color)), size = 7, family = "STSong") +
geom_text(aes(x = name, y = 9500, label = paste("| ", name), colour = I(color)), family = 'STSongti-SC-Bold') +
geom_image(data = df9, aes(x = name, y = -400, image = img)) +
expand_limits(y = -300) +
coord_flip() +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank()) +
labs(title = "你的村子里五行缺啥?\n") +
theme_void(base_size = 20, base_family = "STSong") +
theme(plot.title = element_text(hjust = 0.1,
family = "STSongti-SC-Bold",
colour = "white")) +
theme(plot.margin = grid::unit(c(0.8, 0.8, 1.8, 0.8), "cm")) +
theme(legend.position = "none")) +
theme(panel.background = element_rect(fill = "grey60")) +
theme(plot.background = element_rect(fill = "grey60"))

ggdraw(p9) +
draw_label("单位:个", x = 0.9, y = 0.9, fontfamily = 'STSong', size = 14, colour = "white") +
draw_label("数据来源:国家统计局", x = 0.87, y = 0.05, fontfamily = 'STSong', size = 14, colour = "white") +
draw_image("https://www.czxa.top/images/default28.png",
x = 0.7, y = 0.02, width = 0.06, height = 0.06) +
theme(panel.background = element_rect(fill = "grey20")) +
theme(plot.background = element_rect(fill = "grey20"))

你的村子里五行缺啥?

矿产的分布

还有很多村子使用矿产起名。根据原推文的选择,本文也选择了同样的几种矿产:

Stata
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
use 中国行政区划, clear
keep if index(name, "金") | ///
index(name, "石") | ///
index(name, "玉") | ///
index(name, "铁") | ///
index(name, "银") | ///
index(name, "铜") | ///
index(name, "煤") | ///
index(name, "炭") | ///
index(name, "锡") | ///
index(name, "晶") | ///
index(name, "铅") | ///
index(name, "铂")

replace name = "金" if index(name, "金")
replace name = "石" if index(name, "石")
replace name = "玉" if index(name, "玉")
replace name = "铁" if index(name, "铁")
replace name = "银" if index(name, "银")
replace name = "铜" if index(name, "铜")
replace name = "煤" if index(name, "煤")
replace name = "炭" if index(name, "炭")
replace name = "锡" if index(name, "锡")
replace name = "晶" if index(name, "晶")
replace name = "铅" if index(name, "铅")
replace name = "铂" if index(name, "铂")

contract name
gsort -_freq
ren _freq freq
save 矿产分布, replace

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
df10 <- read.dta13("矿产分布.dta")
df10$name <- factor(df10$name, levels = df10[order(df10$freq),]$name)
(p10 <- ggplot(data = df10) +
geom_bar(aes(x = name, y = freq, fill = name), stat = "identity") +
geom_text(aes(x = name, y = freq + 700, label = freq, colour = name), size = 6, family = "STSong") +
geom_text(aes(x = name, y = 16000, label = paste("| ", name), colour = name), family = 'STSongti-SC-Bold') +
coord_flip() +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank()) +
scale_fill_brewer(palette = "Paired") +
scale_colour_brewer(palette = "Paired") +
labs(title = "有矿村名排行榜\n") +
theme_void(base_size = 20, base_family = "STSong") +
theme(plot.title = element_text(hjust = 0.1,
family = "STSongti-SC-Bold")) +
theme(plot.margin = grid::unit(c(0.8, 0.8, 1.8, 0.8), "cm")) +
theme(legend.position = "none"))

ggdraw(p10) +
draw_label("单位:个", x = 0.9, y = 0.9, fontfamily = 'STSong', size = 14) +
draw_label("数据来源:国家统计局", x = 0.87, y = 0.05, fontfamily = 'STSong', size = 14) +
draw_image("https://www.czxa.top/images/default28.png",
x = 0.7, y = 0.02, width = 0.06, height = 0.06) +
theme(panel.background = element_rect(fill = "grey50")) +
theme(plot.background = element_rect(fill = "grey50"))

忘记放图片了。。。

矿产的地理分布

最后再绘制矿产的地理分布:

Stata
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
use 中国行政区划+经纬度, clear
keep if index(name, "金") | ///
index(name, "石") | ///
index(name, "玉") | ///
index(name, "铁") | ///
index(name, "银") | ///
index(name, "铜") | ///
index(name, "煤") | ///
index(name, "炭") | ///
index(name, "锡") | ///
index(name, "晶") | ///
index(name, "铅") | ///
index(name, "铂")
gen 矿 = "金" if index(name, "金")
replace 矿 = "石" if index(name, "石")
replace 矿 = "玉" if index(name, "玉")
replace 矿 = "铁" if index(name, "铁")
replace 矿 = "银" if index(name, "银")
replace 矿 = "铜" if index(name, "铜")
replace 矿 = "煤" if index(name, "煤")
replace 矿 = "炭" if index(name, "炭")
replace 矿 = "锡" if index(name, "锡")
replace 矿 = "晶" if index(name, "晶")
replace 矿 = "铅" if index(name, "铅")
replace 矿 = "铂" if index(name, "铂")
save 矿产村庄的位置分布, replace

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
df11 <- read.dta13("矿产村庄的位置分布.dta")
(p11 <- ggplot(data = df) +
geom_sf(fill = "antiquewhite1") +
geom_point(data = df11, aes(x = longitude, y = latitude,
colour = 矿)) +
geom_text(aes(LON, LAT, label = cname),
size = 4, hjust = "left", color = "darkblue",
fontface = "bold", check_overlap = T, family = "STSong") +
annotate(geom = "text", x = 115, y = 15, label = "中国南海",
fontface = "italic", color = "darkblue",
size = 4, family = "STSong") +
coord_sf(xlim = c(140, 60), ylim = c(5, 60), expand = F) +
annotation_scale(location = "bl", width_hint = 0.5) +
annotation_north_arrow(location = "bl", which_north = "true",
pad_x = unit(0.75, "cm"),
pad_y = unit(0.5, "cm"),
style = north_arrow_fancy_orienteering) +
labs(title = "你的村子有矿么?\n") +
theme(panel.grid.major = element_line(color = gray(0.5),
linetype = "dashed",
size = 0.5),
panel.background = element_rect(fill = "aliceblue")) +
theme(plot.title = element_text(hjust = 0.1,
family = "STSongti-SC-Bold")) +
theme(plot.margin = grid::unit(c(1, 0.3, 1.8, 0.3), "cm")) +
theme(axis.title = element_blank()))

ggdraw(p11) +
draw_label("数据来源:国家统计局", x = 0.87, y = 0.05, fontfamily = 'STSong', size = 14) +
draw_image("https://www.czxa.top/images/default28.png",
x = 0.7, y = 0.02, width = 0.06, height = 0.06)

你的村子有矿么?

终于完成了!打算把绘制中国地图的方法封装一下方便自己下次画地图的时候使用。

# R, Stata

评论

程振兴

程振兴 @czxa.top
截止今天,我已经在本博客上写了658.8k个字了!

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×