R for Data Science(一)

R for Data Science(一)

和上面一篇博客一样,本文也是一本书的学习笔记,这本书就是大名鼎鼎的:R for Data Science。全书一共三十章,本文涵盖了前12章自觉比较陌生的东西。本书上有一些练习题,答案可以在这里找到:r4ds-exercise-solutions

Data visualisation

首先按照本书需要的包:

R
1
devtools::install_github("hadley/r4ds")
R
1
2
3
library(r4ds)
library(tibble)
library(tidyverse)

查看一个数据框的行列:

R
1
2
3
4
> nrow(mtcars)
[1] 32
> ncol(mtcars)
[1] 11

审视数据框:

R
1
glimpse(mtcars)

其实这里的数据框不是R本身提供的数据框,而是tibble包提供的tibble数据框,更容易使用。

创建一个tibble数据框(只是其中的一种方式):

R
1
2
3
4
5
6
7
8
demo <- tribble(
~cut, ~freq,
"Fair", 1610,
"Good", 4906,
"Very Good", 12082,
"Premium", 13791,
"Ideal", 21551
)

count()函数:

R
1
2
3
4
mpg %>% 
count(class, drv) %>%
ggplot(aes(x = class, y = drv, fill = n)) +
geom_tile()

complete()函数:填充

R
1
2
3
4
5
mpg %>% 
count(class, drv) %>%
complete(class, drv, fill = list(n = 0L)) %>%
ggplot(aes(x = class, y = drv, fill = n)) +
geom_tile()

统计变换

计数条形图:

R
1
2
3
demo %>% 
ggplot() +
geom_bar(aes(x = cut, y = freq), stat = 'identity')

R
1
2
3
diamonds %>% 
ggplot() +
geom_bar(aes(cut, ..count..), stat = 'count')

R
1
2
3
diamonds %>% 
ggplot() +
stat_count(aes(x = cut))

比例条形图:

R
1
2
3
diamonds %>% 
ggplot() +
geom_bar(aes(cut, ..prop.., group = 1))

stat_summary():总结每一个唯一x值的y值:

R
1
2
3
4
5
6
7
8
diamonds %>% 
ggplot() +
stat_summary(
aes(cut, depth),
fun.ymin = min,
fun.ymax = max,
fun.y = median
)

为数据添加扰动:

R
1
2
3
4
5
6
mpg %>% 
ggplot() +
geom_point(
aes(displ, hwy),
position = 'jitter'
)

R
1
2
3
4
5
6
mpg %>% 
ggplot() +
geom_point(
aes(displ, hwy),
position = position_jitter(width = 2)
)

R
1
2
3
mpg %>% 
ggplot(aes(displ, hwy)) +
geom_jitter(width = 20)

R
1
2
3
mpg %>% 
ggplot(aes(displ, hwy)) +
geom_jitter(height = 20)

coord_quickmap():正确设置地图的宽高比例:

R
1
2
3
4
library(patchwork)
nz <- map_data('nz')
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = 'black') + coord_quickmap()

coord_polar():极坐标变换:

R
1
2
3
4
5
6
7
8
9
10
11
bar <- ggplot(diamonds) + 
geom_bar(
aes(cut, fill = cut),
show.legend = F,
width = 1
) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
(bar + coord_flip()) +
(bar + coord_polar()) +
(bar + coord_polar(theta = 'y'))

geom_boxplot()设置positions参数:

R
1
2
3
4
ggplot(data = mpg, aes(x = drv, y = hwy, colour = class)) +
geom_boxplot(position = "identity") +
ggplot(data = mpg, aes(x = drv, y = hwy, colour = class)) +
geom_boxplot(position = 'dodge')

dodge2:默认:

R
1
2
ggplot(data = mpg, aes(x = drv, y = hwy, colour = class)) +
geom_boxplot(position = 'dodge2')

绘制饼图:

R
1
2
3
4
mpg %>% 
ggplot() +
geom_bar(aes(x = factor(1), fill = drv)) +
coord_polar(theta = 'y')

coord_fixed():可以确保geom_abline()生成的线条呈45度角:

R
1
2
3
4
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_abline() +
coord_fixed()

数据变换

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
library(nycflights13)
flights %>% head()

# A tibble: 6 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
# … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
# dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>

tibble数据框的变量类型

  • int:整数
  • dbl:双精度实数
  • chr:字符、字符串
  • dttm:日期时间
  • lgl:逻辑变量
  • fctr:因子变量
  • date:日期

dplyr基础知识

filter可以用逗号隔开多个过滤条件:

R
1
2
3
filter(flights,
month == 1,
day == 1)

下面的结果可能是令你惊讶的:

R
1
2
3
sqrt(2)^2 == 2 # FALSE
# 这是因为计算机使用的是有限精度算法,所以你看到的每个数字多是近似值
near(sqrt(2)^2, 2) # TRUE

filter()也支持逻辑运算,例如查找11月12月离开的所有航班:

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
> flights %>% 
+ filter(month == 11 | month == 12)
# A tibble: 55,403 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 11 1 5 2359 6 352
2 2013 11 1 35 2250 105 123
3 2013 11 1 455 500 -5 641
4 2013 11 1 539 545 -6 856
5 2013 11 1 542 545 -3 831
6 2013 11 1 549 600 -11 912
7 2013 11 1 550 600 -10 705
8 2013 11 1 554 600 -6 659
9 2013 11 1 554 600 -6 826
10 2013 11 1 554 600 -6 749
# … with 55,393 more rows, and 12 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>

# 或者
nov_dec <- filter(flights, month %in% c(11, 12))

下面的结果是令人惊奇的:

R
1
2
3
4
5
# NA == NA # NA
# NA * 0 # NA
# NA ^ 0 == 1 # TRUE
# Inf * 0 # NaN:非数字
# -Inf * 0 # NaN:非数字

filter()仅包括条件所在行的TRUE,它排除了FALSE和NA,如果你想要保留缺失值,使用:

R
1
2
3
df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)
filter(df, is.na(x) | x > 1)

select()函数支持分号运算符,例如:

R
1
flights %>% select(year:day)

你可以在select()中使用很多辅助函数:

R
1
2
3
4
5
6
7
8
9
10
11
12
13
# 匹配以"y"开头的变量名称
flights %>% select(starts_with('y'))

# 匹配以"r"结尾的变量名称
flights %>% select(ends_with('r'))

# 匹配包含'e'的变量名称
flights %>% select(contains('e'))

# 根据正则表达式匹配:例如匹配以y结尾的变量名称
flights %>% select(matches("(.)y"))

# num_range("x", 1:3):匹配x1、x2和x3

select() + everything()函数可以用来调整变量的顺序,例如把time_hour, air_time两个变量移动到最前面:

R
1
2
flights %>% 
select(time_hour, air_time, everything())

补充知识点:

R
1
2
3
4
5
# 取余函数:%%
c(6, 12, 24, 25) %% 24

# 取整函数:%/%
c(6, 12, 24, 25) %/% 24

将缺失值排列在前面:

R
1
arrange(flights, desc(is.na(dep_time)), dep_time)

通过指定变量的列号也可以选择变量,实际上everything()产生的也是变量列号:

R
1
select(flights, 1, 2, 3)

使用one_of()选择变量:

R
1
2
var <- c("dep_time", "dep_delay", "arr_time", "arr_delay")
select(flights, one_of(var))

contains()辅助函数默认是不区分大小写的,如果想要区分大小写可以设置ignore.case参数为FALSE:

R
1
select(flights, contains("TIME", ignore.case = FALSE))

transmute():生成新变量后,只保留新变量:

R
1
2
3
4
5
6
transmute(
flights,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)

lage()和lead():

R
1
2
3
4
5
6
> (x <- 1:10)
[1] 1 2 3 4 5 6 7 8 9 10
> (lag(x))
[1] NA 1 2 3 4 5 6 7 8 9
> (lead(x))
[1] 2 3 4 5 6 7 8 9 10 NA

累加和累积:

R
1
2
3
4
5
> cumsum(x)
[1] 1 3 6 10 15 21 28 36 45 55
> cumprod(x)
[1] 1 2 6 24 120 720 5040 40320
[9] 362880 3628800

累计最小值和累计最大值:

R
1
2
3
4
> cummin(x)
[1] 1 1 1 1 1 1 1 1 1 1
> cummax(x)
[1] 1 2 3 4 5 6 7 8 9 10

累积均值:

R
1
2
> cummean(x)
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5

排名函数:

R
1
2
3
4
5
> y <- c(1, 2, 2, NA, 3, 4)
> min_rank(y)
[1] 1 2 2 NA 4 5
> min_rank(desc(y))
[1] 5 3 3 NA 2 1

行号函数(会忽略NA):

R
1
2
> row_number(y)
[1] 1 2 3 NA 4 5

连续排名函数:

R
1
2
> dense_rank(y)
[1] 1 2 2 NA 3 4

百分比排名函数:

R
1
2
> percent_rank(y)
[1] 0.00 0.25 0.25 NA 0.75 1.00

累积分布:

R
1
2
> cume_dist(y)
[1] 0.2 0.6 0.6 NA 0.8 1.0

IQR():内距:

R
1
2
3
4
5
> IQR(y, na.rm = T)
[1] 1
> quantile(y, 0.75, na.rm = T) - quantile(y, 0.25, na.rm = T)
75%
1

计算奇异值的数量:

R
1
2
> n_distinct(y)
[1] 5

分组计数:

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
> diamonds %>% 
+ count(cut_width(carat, 0.5))
# A tibble: 11 x 2
`cut_width(carat, 0.5)` n
<fct> <int>
1 [-0.25,0.25] 785
2 (0.25,0.75] 29498
3 (0.75,1.25] 15977
4 (1.25,1.75] 5313
5 (1.75,2.25] 2002
6 (2.25,2.75] 322
7 (2.75,3.25] 32
8 (3.25,3.75] 5
9 (3.75,4.25] 4
10 (4.25,4.75] 1
11 (4.75,5.25] 1

箱线图排序:

R
1
2
3
4
ggplot(mpg) + 
geom_boxplot(
aes(x = reorder(class, hwy, FUN = median), y = hwy)
)

连续变量的箱线图:
等组宽:

R
1
2
3
ggplot(diamonds,
aes(x = carat, y = price)) +
geom_boxplot(aes(group = cut_width(carat, 0.3)))

等数量分组:

R
1
2
3
ggplot(diamonds,
aes(x = carat, y = price)) +
geom_boxplot(aes(group = cut_number(carat, 15)))

从模型中提取数据:

R
1
2
3
4
5
6
7
8
library(modelr)
mod <- lm(log(price) ~ log(carat), data = diamonds)
diamonds2 <- diamonds %>%
add_residuals(mod) %>%
mutate(resid = exp(resid))

ggplot(diamonds2) +
geom_point(aes(carat, resid))

tibbles

创建tibble

R
1
2
3
4
5
6
as_tibble(iris)
tibble(
x = 1:5,
y = 1,
z = x^2 + y
)

tibble不会把字符串变量变成因子变量。

创建tribble的另一种方式是使用tribble函数:

R
1
2
3
4
5
tribble(
~x, ~y, ~z,
"a", 2, 3.6,
"b", 1, 8.5
)

我觉得这种创建方式的好处在于,我可以直接把csv表格复制过来。

tibles有一个精确的打印方法,只显示前10行以及适合屏幕的所有列:

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
library(lubridate)
tibble(
a = now() + runif(1e3) * 86400,
b = today() + runif(1e3) * 30,
c = 1:1e3,
d = runif(1e3),
e = sample(letters, 1e3, replace = T)
)

# A tibble: 1,000 x 5
a b c d e
<dttm> <date> <int> <dbl> <chr>
1 2019-01-10 11:10:35 2019-02-04 1 0.763 w
2 2019-01-10 15:38:17 2019-01-24 2 0.477 s
3 2019-01-10 19:02:05 2019-01-11 3 0.129 a
4 2019-01-10 06:37:49 2019-01-31 4 0.452 t
5 2019-01-09 23:06:41 2019-01-17 5 0.555 t
6 2019-01-10 17:46:46 2019-01-14 6 0.495 k
7 2019-01-10 19:09:44 2019-01-22 7 0.250 c
8 2019-01-10 00:46:16 2019-01-23 8 0.207 x
9 2019-01-10 09:24:46 2019-01-10 9 0.465 n
10 2019-01-10 20:27:53 2019-02-07 10 0.320 c
# … with 990 more rows

打印20行、所有列:

R
1
2
flights %>% 
print(n = 20, width = Inf)

使用RStudio的内置数据查看器来获取完整数据集的视图:

R
1
flights %>% View()

子集

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
(df <- tibble(
x = runif(5),
y = rnorm(5)
))

> df$x
[1] 0.4529614 0.7015420 0.8720251 0.4934006 0.1010092
> df[1]
# A tibble: 5 x 1
x
<dbl>
1 0.453
2 0.702
3 0.872
4 0.493
5 0.101
> df[[1]]
[1] 0.4529614 0.7015420 0.8720251 0.4934006 0.1010092

要在管道符中使用它们,需要使用特殊占位符.

R
1
2
3
4
> df %>% .$x
[1] 0.4529614 0.7015420 0.8720251 0.4934006 0.1010092
> df %>% .[['x']]
[1] 0.4529614 0.7015420 0.8720251 0.4934006 0.1010092

数据导入

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
read_csv(
"a, b, c
1, 2, 3
4, 5, 6"
)

read_csv("The first line of metadata
The second line of metadata
x, y, z
1, 2, 3", skip = 2)

read_csv("1, 2, 3 \n 4, 5, 6", col_names = F)

read_csv("# A comment I want to skip
x, y, z
1, 2, 3", comment = '#')

read_csv("1, 2, 3 \n 4, 5, 6", col_names = c('x', 'y', 'z'))

read_csv("a, b, c \n 1, -2, .", na = '.')

解析向量

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
str(parse_logical(c("TRUE", "FALSE", "NA")))

str(parse_integer(c("1", "2", "3")))

str(parse_date(c("2019-01-01",
"1979-10-02")))

parse_integer(c('1', '231', '.', '456'), na = '.')

# 如果解析失败,你会收到警告,并且输出中会丢弃失败的部分:
(x <- parse_integer(c("123",
"345",
"abc",
"123.45")))
# 查看错误:
problems(x)

parse_double("1.23")

parse_double("1,23", locale = locale(decimal_mark = ","))

# parse_number()会忽略数字前后的非数字字符
parse_number("$100")
parse_number("20%")
parse_number("It cost $123.45")
parse_number("$123,456,789")
parse_number("123.456.789",
locale = locale(grouping_mark = '.'))

# 把字符变成16进制:
charToRaw("程振兴")

x1 <- "El Ni\xf1o was particularly bad this year"
x2 <- "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"

# 猜测x1的编码
guess_encoding(charToRaw(x1))
parse_character(x1, locale = locale(encoding = 'ISO-8859-1'))

guess_encoding(charToRaw(x2)) # 猜错了
parse_character(x2, locale = locale(encoding = 'KOI8-R'))
parse_character(x2, locale = locale(encoding = 'Shift-JIS'))

fruit <- c("apple", "banana")
parse_factor(c("apple", "banana", "bananana"), levels = fruit)

# parse_datetime()的输入应该是一个ISO8601日期时间,这个标准下的日期组成是:年,月,日,小时,分钟,秒
parse_datetime("2019-01-01T2010")
parse_datetime("20190101")
parse_date('2010-01-01')
parse_time("01:10 am")
parse_time("20:10:01")

parse_date("01/01/15", "%m/%d/%y")
parse_date("01/02/15", "%d/%m/%y")
parse_date("2015年10月10日", "%Y年%m月%d日")
str(parse_guess("2010-01-01"))

下面的示例会出错:

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
challenge <- read_csv(
readr_example("challenge.csv")
)

# Parsed with column specification:
# cols(
# x = col_double(),
# y = col_logical()
# )
# Warning: 1000 parsing failures.
# row col expected actual file
# 1001 y 1/0/T/F/TRUE/FALSE 2015-01-16 '/Library/Frameworks/R.framework/Versions/3.5/Resources/library/readr/extdata/challenge.csv'
# 1002 y 1/0/T/F/TRUE/FALSE 2018-05-18 '/Library/Frameworks/R.framework/Versions/3.5/Resources/library/readr/extdata/challenge.csv'
# 1003 y 1/0/T/F/TRUE/FALSE 2015-09-05 '/Library/Frameworks/R.framework/Versions/3.5/Resources/library/readr/extdata/challenge.csv'
# 1004 y 1/0/T/F/TRUE/FALSE 2012-11-28 '/Library/Frameworks/R.framework/Versions/3.5/Resources/library/readr/extdata/challenge.csv'
# 1005 y 1/0/T/F/TRUE/FALSE 2020-01-13 '/Library/Frameworks/R.framework/Versions/3.5/Resources/library/readr/extdata/challenge.csv'
# .... ... .................. .......... ............................................................................................
# See problems(...) for more details.


problems(challenge)

修复:

R
1
2
3
4
5
6
7
8
9
challenge <- read_csv(
readr_example("challenge.csv"),
col_types = cols(
x = col_double(),
y = col_character()
)
)

tail(challenge)

错误的原因在于,日期变量被错误的读为logical变量。

还可以默认将所有列读作字符向量,然后再修改:

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
challenge2 <- read_csv(
readr_example('challenge.csv'),
col_types = cols(.default = col_character())
)

tail(challenge2)

# A tibble: 6 x 2
x y
<chr> <chr>
1 0.805274312151596 2019-11-21
2 0.1635163405444473 2018-03-29
3 0.47193897631950676 2014-08-04
4 0.7183186465408653 2015-08-16
5 0.26987858884967864 2020-02-04
6 0.608237189007923 2019-01-06

type_convert():自动转型

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
df <- tribble(
~x, ~y,
"1", "1.21",
"2", "2.32",
"3", "4.56"
)

type_convert(df)

# A tibble: 3 x 2
x y
<dbl> <dbl>
1 1 1.21
2 2 2.32
3 3 4.56

保存为文件:

R
1
2
3
4
5
6
7
8
9
10
11
12
write_csv(challenge, "challenge.csv")
write_rds(challenge, "challenge.rds")
challenge3 <- read_rds("challenge.rds")
# 比较两个对象
identical(challenge3, challenge) # TRUE

# feather格式可以方便数据的跨文件共享
library(feather)
write_feather(challenge, "challenge.feather")
challenge4 <- read_feather("challenge.feather")
# 比较两个对象
identical(challenge4, challenge) #FALSE

Tidy数据

加权计数:

R
1
2
3
4
5
6
table1 %>% 
count(year, wt = cases)
# 等价于
table1 %>%
group_by(year) %>%
summarise(n = sum(cases))

gather() & spread()

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
> data(table4a)
>
> table4a %>%
+ gather(`1999`, `2000`, key = "year", value = "cases")
# A tibble: 6 x 3
country year cases
<chr> <chr> <int>
1 Afghanistan 1999 745
2 Brazil 1999 37737
3 China 1999 212258
4 Afghanistan 2000 2666
5 Brazil 2000 80488
6 China 2000 213766
>
> t4a <- table4a %>%
+ gather(`1999`, `2000`, key = 'year', value = 'cases')
> t4b <- table4b %>%
+ gather(`1999`, `2000`, key = "year", value = "population")
>
> left_join(t4a, t4b)
Joining, by = c("country", "year")
# A tibble: 6 x 4
country year cases population
<chr> <chr> <int> <int>
1 Afghanistan 1999 745 19987071
2 Brazil 1999 37737 172006362
3 China 1999 212258 1272915272
4 Afghanistan 2000 2666 20595360
5 Brazil 2000 80488 174504898
6 China 2000 213766 1280428583
> table2
# A tibble: 12 x 4
country year type count
<chr> <int> <chr> <int>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
> table2 %>%
+ spread(key = type, value = count)
# A tibble: 6 x 4
country year cases population
<chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583

separate()和unite()

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
> table3
# A tibble: 6 x 3
country year rate
* <chr> <int> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
> table3 %>%
+ separate(rate, into = c("cases", "population"), sep = '/') %>%
+ type.convert()
# A tibble: 6 x 4
country year cases population
<fct> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
> # 或者
> table3 %>%
+ separate(rate, into = c("cases", "population"), convert = T)
# A tibble: 6 x 4
country year cases population
<chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583

> # 也可以按照长度分割,sep参数是指分割后得到的第一个变量的长度
> table3 %>%
+ separate(year, into = c("century", "year"),
+ sep = 2)
# A tibble: 6 x 4
country century year rate
<chr> <chr> <chr> <chr>
1 Afghanistan 19 99 745/19987071
2 Afghanistan 20 00 2666/20595360
3 Brazil 19 99 37737/172006362
4 Brazil 20 00 80488/174504898
5 China 19 99 212258/1272915272
6 China 20 00 213766/1280428583

> table5 %>%
+ unite(new, century, year, sep = '') %>%
+ type_convert()
Parsed with column specification:
cols(
country = col_character(),
new = col_double(),
rate = col_character()
)
# A tibble: 6 x 3
country new rate
<chr> <dbl> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583

缺失值

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
> (stocks <- tibble(
+ year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
+ qtr = c( 1, 2, 3, 4, 2, 3, 4),
+ return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
+ ))
# A tibble: 7 x 3
year qtr return
<dbl> <dbl> <dbl>
1 2015 1 1.88
2 2015 2 0.59
3 2015 3 0.35
4 2015 4 NA
5 2016 2 0.92
6 2016 3 0.17
7 2016 4 2.66
>
> stocks %>% spread(year, return)
# A tibble: 4 x 3
qtr `2015` `2016`
<dbl> <dbl> <dbl>
1 1 1.88 NA
2 2 0.59 0.92
3 3 0.35 0.17
4 4 NA 2.66
>
> stocks %>%
+ spread(year, return) %>%
+ gather(year, return, `2015`:`2016`, na.rm = T) %>%
+ type.convert()
# A tibble: 6 x 3
qtr year return
<int> <int> <dbl>
1 1 2015 1.88
2 2 2015 0.59
3 3 2015 0.35
4 2 2016 0.92
5 3 2016 0.17
6 4 2016 2.66

complete()

R
1
2
3
4
5
6
7
8
9
10
11
12
> stocks %>% complete(year, qtr)
# A tibble: 8 x 3
year qtr return
<dbl> <dbl> <dbl>
1 2015 1 1.88
2 2015 2 0.59
3 2015 3 0.35
4 2015 4 NA
5 2016 1 NA
6 2016 2 0.92
7 2016 3 0.17
8 2016 4 2.66

fill():填补缺失值

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
> treatment <- tribble(
+ ~ person, ~ treatment, ~response,
+ "Derrick Whitmore", 1, 7,
+ NA, 2, 10,
+ NA, 3, 9,
+ "Katherine Burke", 1, 4
+ )

> # 使用前面一个非缺失值填补
> treatment %>%
+ fill(person)
# A tibble: 4 x 3
person treatment response
<chr> <dbl> <dbl>
1 Derrick Whitmore 1 7
2 Derrick Whitmore 2 10
3 Derrick Whitmore 3 9
4 Katherine Burke 1 4

一个数据整理的真实案例:

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
who1 <- who %>% 
gather(new_sp_m014:newrel_f65, key = 'key', value = 'cases', na.rm = T)
who1

who2 <- who1 %>%
mutate(
key = stringr::str_replace(key, "newrel", "new_rel")
)

who3 <- who2 %>%
separate(key, c("new", "type", "sexage"), sep = "_")

who4 <- who3 %>%
select(-new, -iso2, -iso3)

who5 <- who4 %>%
separate(sexage, c("sex", "age"), sep = 1)

# 总结起来:
who %>%
gather(key, value, new_sp_m014:newrel_f65, na.rm = T) %>%
mutate(key = stringr::str_replace(key, "newrel", "new_rel")) %>%
separate(key, c("new", "var", "sexage")) %>%
select(-new, -iso2, -iso3) %>%
separate(sexage, c("sex", "age"), sep = 1) %>%
arrange(country)
# R

评论

程振兴

程振兴 @czxa.top
截止今天,我已经在本博客上写了636.8k个字了!

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×