ggplot2:工具箱

ggplot2:工具箱

本文介绍了如何解决绘图中遇到的一些问题。

基本的图形类型

1
2
3
4
5
6
7
8
9
library(ggplot2)
library(ggthemes)
df <- data.frame(
x = c(3, 1, 5),
y = c(2, 4, 6),
label = c("a", "b", "c")
)
p <- ggplot(df, aes(x, y)) + xlab(NULL) + ylab(NULL)
p + geom_point() + labs(title = "geom_point") + theme_igray(base_family = "Times New Roman", base_size = 15)


1
p + geom_bar(stat = "identity") + labs(title = "geom_bar(stat = \"identity\")") + theme_igray(base_family = "Times New Roman", base_size = 15)


1
p + geom_line() + labs(title = "geom_line") + theme_igray(base_family = "Times New Roman", base_size = 15)


1
p + geom_area() + labs(title = "geom_area") + theme_igray(base_family = "Times New Roman", base_size = 15)


1
p + geom_path() + labs(title = "geom_path") + theme_igray(base_family = "Times New Roman", base_size = 15)


1
p + geom_text(aes(label = label)) + labs(title = "geom_text") + theme_igray(base_family = "FiraSans-Regular", base_size = 15)


1
p + geom_tile() + labs(title = "geom_tile") + theme_igray(base_family = "FiraSans-Regular", base_size = 15)


1
p + geom_polygon() + labs(title = "geom_polygon") + theme_igray(base_family = "FiraSans-Regular", base_size = 15)

展示数据分布

freqplot & histogram

1
2
3
4
depth_dist <- ggplot(diamonds, aes(depth)) + xlim(58, 68)
depth_dist +
geom_histogram(aes(y = ..density..), binwidth = 0.1) +
facet_grid(cut ~ .) + theme_igray(base_family = "FiraSans-Regular", base_size = 15)


1
depth_dist + geom_histogram(aes(fill = cut), binwidth = 0.1, position = "fill") + theme_igray(base_family = "FiraSans-Regular", base_size = 15)


1
depth_dist + geom_freqpoly(aes(y = ..density.., colour = cut), binwidth = 0.1) + theme_igray(base_size = 20, base_family = "FiraSans-Regular")

作为几何对象的直方图和频率多边形均使用了stat_bin统计变换,此统计变换生成了两个输出变量count和density。变量count为默认值,因为它的可解释性更好。而变量density基本上相当于count除以count的总数,此变量在我们想要比较不同分布的形状而不是数据的绝对大小的适合更有用。特别地,我们经常使用此变量比较数据中不同大小子集的分布。

boxplot

下面展示了针对类别或连续型变量取条件所得的箱线图:

1
2
library(plyr)
ggplot(diamonds, aes(cut, depth)) + geom_boxplot() + theme_igray(base_size = 20, base_family = "FiraSans-Regular")


1
ggplot(diamonds, aes(carat, depth, group = round_any(carat, 0.1, floor))) + xlim(0, 3) + geom_boxplot() + theme_igray(base_family = "FiraSans-Regular", base_size = 20)

jitter

geom_jitter = position_gitter + geom_point:通过在离散型分布上添加随机噪声以避免遮盖问题,这是一种较为粗糙的方法:

1
ggplot(mpg, aes(class, cty)) + geom_jitter() + theme_igray(base_family = "FiraSans-Regular", base_size = 20)


1
ggplot(mpg, aes(class, drv)) + geom_jitter() + theme_igray(base_family = "FiraSans-Regular", base_size = 20)

density

1
ggplot(diamonds, aes(depth)) + geom_density() + xlim(54, 70) + theme_igray(base_family = "FiraSans-Regular", base_size = 20)


1
ggplot(diamonds, aes(depth, fill = cut, alpha = I(0.2))) + geom_density() + xlim(54, 70) + theme_igray(base_family = "FiraSans-Regular", base_size = 20)

处理遮盖绘制问题

对于小范围的遮盖问题,可以通过绘制更小的点来解决:

1
2
3
df <- data.frame(x = rnorm(2000), y = rnorm(2000))
norm <- ggplot(df, aes(x, y, colour = I("#177E89"))) + theme_igray(base_size = 20, base_family = "FiraSans-Regular")
norm + geom_point()


1
norm + geom_point(shape = 1)


1
norm + geom_point(shape = ".") ## 点的大小为像素级

对于更大的数据集产生的遮盖问题,我们可以使用调整透明度让点呈现透明效果。在R中,可用的最小透明度为1/256。

1
norm + geom_point(alpha = 1/3)

如果数据存在一定的离散性,我们可以通过在点上增加随机扰动来减轻重叠:

1
2
td <- ggplot(diamonds, aes(table, depth, colour = I("#4472CA"))) + xlim(50, 80) + ylim(50, 70) + theme_igray(base_size = 15, base_family = "FiraSans-Regular")
td + geom_point()


1
td + geom_jitter()


1
2
jit <- position_jitter(width = 0.5)
td + geom_jitter(position = jit)


1
td + geom_jitter(position = jit, alpha = 1/10)

受此启发,我们也可以认为遮盖问题是一种二维核密度估计问题,于是又可引申出以下两种方法:

  1. 将点分箱并统计每个箱中点的数量,然后以某种方式可视化这个数量。将图形划分成小的正方形可能会产生分散注意力的视觉假象。Carr(1987)建议使用六边形取代之,这类图形可以使用geom_hexagon来实现,它使用来hexbin包:
    1
    2
    d <- ggplot(diamonds, aes(carat, price)) + xlim(1, 3) + theme_igray(base_size = 20, base_family = "FiraSans-Regular") + theme(legend.position = "none")
    d + stat_bin2d()


1
d + stat_bin2d(bins = 100)


1
d + stat_bin2d(binwidth = c(0.02, 200))


1
d + stat_binhex()


1
d + stat_binhex(bins = 10)


1
d + stat_binhex(binwidth = c(0.02, 200))

  1. 使用stat_density2d作二维密度估计,并将等高线添加到散点图中,或者使用着色瓦片直接展示密度,或使用大小与分布密度成比例的点进行展示:
    1
    2
    d <- ggplot(diamonds, aes(carat, price, colour = I("#9068be"))) + xlim(1, 3) + theme(legend.position = "none") + theme_igray(base_size = 20, base_family = "FiraSans-Regular")
    d + geom_point() + geom_density2d()


1
d + stat_density2d(geom = "point", aes(size = ..density..), contour = F) + scale_size_area() + theme(legend.position = "none")


1
2
d + stat_density2d(geom = "tile", aes(fill = ..density..), contour = F) + theme(legend.position = "none")
last_plot() + scale_fill_gradient(limits = c(1e-5, 8e-4))

绘制地图

添加地图边界可以通过函数borders()来完成。函数的前两个参数指定了要绘制的地图名map以及其中的具体区域region,其余的参数用于控制边界的外观:如边界的颜色colour和线条的粗细size。如果我们想要的是填充颜色的多边形而不是单纯的边界,可以通过设定参数fill来实现。

1
2
3
4
library(maps)
data("us.cities")
big_citys <- subset(us.cities, pop > 500000)
ggplot(big_citys, aes(long, lat)) + borders("state", size = 0.5) + theme_map() + geom_point()


1
2
3
4
5
6
7
8
9
10
11
12
# devtools::install_github("slowkow/ggrepel")
library(ggrepel)
tx_cities <- subset(us.cities, country.etc == "TX")
tx_cities <- subset(tx_cities, pop > 100000)
ggplot(tx_cities, aes(long, lat)) +
geom_point(colour = "black", alpha = 0.5) +
borders("county", "texas", colour = "grey70") +
theme_map(base_size = 20) +
ggtitle(label = "图:德州人口超过10万的城市") +
theme(plot.title = element_text(size = 20, hjust = 0.5)) +
scale_colour_gradient(high = 'red', low = "orange") +
geom_text_repel(aes(long, lat, label = name, colour = pop))

等值线图则相对更难处理一些,自动化的程度也没那么高,原因在于,要将我们数据中的标志符同地图数据中的标志符完全匹配起来是有一定挑战性的。以下实例展示了如何使用map_data()将地图数据转换为数据框,此数据框可以在之后通过merge()操作与我们的数据相融合,最终绘制出等值线图:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
states <- map_data("state")
(arrests <- USArrests)
names(arrests) <- tolower(names(arrests))
arrests$region <- tolower(rownames(USArrests))

choro <- merge(states, arrests, by = "region")
## 由于绘制多边形时涉及顺序问题,且merge破坏了原始顺序,所以需要重新排序
choro <- choro[order(choro$order),]
ggplot(choro, aes(long, lat, group = group, fill = assault)) +
geom_polygon() +
theme_map() +
scale_fill_gradient(high = 'red', low = "orange") +
ggtitle("图:美国各州人身伤害事件的数量", subtitle = "数据来源:map包") +
theme(plot.title = element_text(size = 20, hjust = 0.5), plot.subtitle = element_text(size = 15, hjust = 0.5))


1
2
3
4
5
6
ggplot(choro, aes(long, lat, group = group, fill = assault/murder)) +
geom_polygon() +
theme_map() +
scale_fill_gradient(high = "red", low = "orange") +
ggtitle("图:美国人身伤害和谋杀类案件的比率", subtitle = "数据来源:R包") +
theme(plot.title = element_text(size = 20, hjust = 0.5), plot.subtitle = element_text(size = 15, hjust = 0.5))

如果我们相对数据作进一步处理,函数map_data()也是很有用的。下例中,我们计算了爱荷华州每个郡近似的中心,然后利用这些中心位置数据在地图上对其名称进行标注:

1
2
3
4
5
library(plyr)
ia <- map_data("county", "iowa")
mid_range <- function(x) mean(range(x, na.rm = T))
centres <- ddply(ia, .(subregion), colwise(mid_range, .(lat, long)))
ggplot(ia, aes(long, lat)) + geom_polygon(aes(group = group), fill = NA, colour = "grey60") + geom_text(aes(label = subregion), data = centres, size = 2, angle = 45) + theme_map(base_size = 15)

揭示不确定性

用于展示区间的几何对象,它们有助于不确定性信息的可视化:

变量X类型 仅展示区间 同时展示区间和中间值
连续型 geom_ribbon geom_smooth(stat = “identity”)
离散型 geom_error_bar geom_crossbar
离散型 geom_linerange geom_pointrange

下例拟合了一个双因素含交互效应的回归模型,并且展示了如何提取边际效应和条件效应,以及如何将其可视化:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
d <- subset(diamonds, carat < 2.5 & rbinom(nrow(diamonds), 1, 0.2) == 1)
d$lcarat <- log10(d$carat)
d$lprice <- log10(d$price)
# 剔除整体的线性趋势
detrend <- lm(lprice ~ lcarat, data = d)
d$lprice2 <- resid(detrend)
mod <- lm(lprice2 ~ lcarat * color, data = d)
# install.packages("effects")
library(effects)
effectdf <- function(...){
suppressWarnings(as.data.frame(effect(...)))
}
color <- effectdf("color", mod)
both1 <- effectdf("lcarat:color", mod)

carat <- effectdf("lcarat", mod, default.levels = 50)
both2 <- effectdf("lcarat:color", mod, default.levels = 3)

ggplot(d, aes(lcarat, lprice, colour = color)) +
geom_point() +
theme_igray(base_size = 20, base_family = "FiraSans-Regular") +
scale_color_tableau()


1
2
3
4
ggplot(d, aes(lcarat, lprice2, colour = color)) +
geom_point() +
theme_igray(base_size = 20, base_family = "FiraSans-Regular") +
scale_color_tableau()


1
2
3
4
5
6
fplot <- ggplot(mapping = aes(y = fit, ymin = lower, ymax = upper)) +
ylim(range(both2$lower, both2$upper))
fplot %+% color + aes(x = color) +
geom_point() + geom_errorbar() +
theme_igray(base_size = 20, base_family = "FiraSans-Regular") +
scale_color_tableau()


1
2
3
4
5
6
fplot %+% both2 +
aes(x = color, colour = lcarat, group = interaction(color, lcarat)) +
geom_errorbar() +
geom_line(aes(group = lcarat)) +
scale_colour_gradient() +
theme_igray(base_size = 20, base_family = "FiraSans-Regular")


1
2
3
fplot %+% carat + aes(x = lcarat) + geom_smooth(stat = "identity") +
theme_igray(base_size = 20, base_family = "FiraSans-Regular") +
scale_color_tableau() + ylim(-0.05, 0.05)


1
2
3
4
5
6
ends <- subset(both1, lcarat == max(lcarat))
fplot %+% both1 + aes(x = lcarat, colour = color) +
geom_smooth(stat = "identity") +
geom_text(aes(label = color, x = lcarat + 0.02), ends) +
theme_igray(base_size = 20, base_family = "FiraSans-Regular") +
scale_color_tableau() + theme(legend.position = "none")

注意,在为这类图形添加题注时,我们需要细致地描述其中所含置信区间的本质,并说明观察置信区间之间的重叠是否有意义(当比较不同组时,如果区间没有重叠,则说明差异显著)。即,这些标准误是针对单组的均值是,如果区间没有重叠,则说明差异显著。即,这些标准误是针对单组的均值的,还是针对不同组件均值之差的。在计算和展示这些标准误时,multcomp包和multcompView包将会非常有用,同时他们在多重比较中能正确地对自由度进行调整。

统计摘要

连续型

中位线曲线

1
2
3
4
5
6
install.packages("ggplot2movies")
library(ggplot2movies)
m <- ggplot(movies, aes(year, rating)) +
theme_igray(base_size = 20, base_family = "FiraSans-Regular") +
scale_color_tableau()
m + stat_summary(fun.y = "median", geom = "line")

median_hilow所得的曲线和平滑带

1
2
3
install.packages("Hmisc")
library(Hmisc)
m + stat_summary(fun.data = "median_hilow", geom = "ribbon", fill = "grey90") + stat_summary(fun.data = "median_hilow", geom = "smooth")

均值曲线

1
m + stat_summary(fun.y = "mean", geom = "line")

mean_cl_boot所得曲线和平滑带

1
m + stat_summary(fun.data = "mean_cl_boot", geom = "ribbon", fill = "grey90") + stat_summary(fun.data = "mean_cl_boot", geom = "smooth")

离散型

均值点

1
2
3
4
m2 <- ggplot(movies, aes(factor(round(rating)), log10(votes))) +
theme_igray(base_size = 20, base_family = "FiraSans-Regular") +
scale_color_tableau()
m2 + stat_summary(fun.y = "mean", geom = "point")

均值点和误差棒

1
m2 + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar")

中位数点和值域

1
m2 + stat_summary(fun.data = "median_hilow", geom = "pointrange")

中位数点和值域条

1
m2 + stat_summary(fun.data = "median_hilow", geom = "crossbar")

使用stat_summary()时,你既可以为每个参数单独指定摘要计算函数,也可以用一个统一的函数对它们进行组合,这两种可选的方式描述如下:

单独的摘要计算函数

参数fun.yfun.yminfun.ymax能够接受简单的数值型摘要计算函数,即该函数能够传入一个数值向量并返回一个数值型结果,如:mean()/median()/min()/max()。

1
2
3
midm <- function(x) mean(x, trim = 0.5)
m2 + stat_summary(aes(colour = "trimmed"), fun.y = midm, geom = "point") +
stat_summary(aes(colour = "raw"), fun.y = mean, geom = "point")

统一的摘要计算函数

fun.data可以支持更复杂的摘要计算函数,例如前面所示的Hmisc包中的函数。我们也可以使用自己编写的摘要计算函数:此函数应返回一个各元素有名称的向量作为输出:

1
2
3
4
5
6
iqr <- function(x, ...){
qs <- quantile(as.numeric(x), c(0.25, 0.75), na.rm = T)
names(qs) <- c("ymin", "ymax")
qs
}
m + stat_summary(fun.data = "iqr", geom = "ribbon")

来自Hmisc包中的摘要计算函数,这些函数用专门的封装,以使它们能够与stat_summary()更轻松地使用:

函数名 Hmisc包中的原名 中间值类型 所计算的区间
mean_cl_normal() smean.cl.boot() 均值 正态渐进所得的标准误
mean_cl_boot() smean.cl.boot() 均值 Bootstrap所得标准误
mean_sdl() smean.sdl() 均值 标准差的倍数
median_hilow() smedian.hilow() 中位数 尾部面积相同的外分位点对

添加图形注解

添加图形注解有两种基本的方式:逐个添加和批量添加。

逐个添加的方式适合少量的、图形属性多样化的注解。我们只要为所想要的图形属性设置好对应的数值就可以了。如果我们需要添加多个具有类似属性的注解,将它们放在数据框里面并一次添加完成也许更有效。

下面的例子中,我们分别使用上面的两种方式,向经济数据中加入有关美国总统的信息。

1
(unemp <- ggplot(economics, aes(date, unemploy)) + geom_line() + xlab("") + ylab("No. unemployed(1000s)") + theme_bw(base_family = "FiraSans-Regular", base_size = 20))


1
2
3
4
presidential <- presidential[-(1:3),]
yrng <- range(economics$unemploy)
xrng <- range(economics$date)
unemp + geom_vline(aes(xintercept = as.numeric(start)), data = presidential)


1
2
library(scales)
unemp + geom_rect(aes(NULL, NULL, xmin = start, xmax = end, fill = party), ymin = yrng[1], ymax = yrng[2], data = presidential, alpha = 0.2)


1
last_plot() + geom_text(aes(x = start, y = (yrng[1] + yrng[2])/1.4, label = name), data = presidential, size = 3, hjust = 0, vjust = 0, angle = 90)


1
2
3
4
# 将句子进行分割,目标是前40个字符作为第一部分(实际要结合单词的边界,不能打破单词)。然后使用\n连接
(caption <- paste(strwrap("Unemployment rates in the US have varied a lot over the years", 40), collapse = "\n"))

unemp + geom_text(aes(x, y, label = caption), data.frame(x = xrng[2], y = yrng[2]), hjust = 1, vjust = 1, size = 4, colour = "darkorange")


1
2
highest <- subset(economics, unemploy == max(unemploy))
unemp + geom_point(data = highest, size = 3, color = "red", alpha = 0.5)

含权数据

1
2
(p <- ggplot(midwest, aes(percwhite, percbelowpoverty)) +
geom_point() + theme_calc(base_size = 20, base_family = "FiraSans-Regular"))


1
p + geom_point(aes(size = poptotal/1e6)) + scale_size_area("Population\n(millions)", breaks = c(0.5, 1, 2, 4))


1
p + geom_point(aes(size = area)) + scale_size_area()

对于更复杂的、涉及到统计变换的情况,我们通过修改weight图形属性来表现权重。这些权重将被传递给统计汇总计算函数。在权重有意义的情况下,各种元素基本都支持权重的设定,例如:各类平滑器、分位数回归、箱线图、直方图以及各种密度图。我们无法直接看到这个权重变量,而且它也没有对应的图例,但它却会改变统计汇总的结果。下图展示来作为权重的人口密度是如何影响来白种人比例和贫困线以下人口比例的关系:

我觉得我要深刻地批注一下:这幅图本质上是没有任何意义的,这里面存在很严重的内生性,白种人普遍较为富有,自然比例越高,贫困线下的人口比例越低。并不能代表因为白种人的比例提高了,贫困问题就解决了。

1
p + geom_smooth(method = "lm", size = 1)


1
p + geom_point(aes(size = popdensity)) + geom_smooth(method = "lm", size = 1, aes(weight = popdensity))

在我们使用总人口作为权重去修改直方图或密度图的时候,我们的视角将从对郡数量分布的观察转向对人口数量(郡的数量乘以人口总数不就是总人口了)分布的观察:

1
qplot(percbelowpoverty, data = midwest, binwidth = 1)


1
qplot(percbelowpoverty, data = midwest, binwidth = 1, weight = poptotal) + ylab("population")

# R

评论

程振兴

程振兴 @czxa.top
截止今天,我已经在本博客上写了607.9k个字了!

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×