在 Python 中像 dplyr 那样操纵数据框

在 Python 中像 dplyr 那样操纵数据框

自从用了 tidyverse 系列的 R 包,彻底爱上了 tidyverse 数据处理流程。之前的每周分享中提到了两个类似的 Python 包。dplythonpandas-ply,你可以在这里下载我的 Jupyter notebook 文件:在 Python 中像 dplyr 那样操作数据框.ipynb

dplython 包

1
2
3
4
5
import pandas as pd
from dplython import \
(DplyFrame, X, diamonds, select, sift, sample_n, \
sample_frac, head, arrange, mutate, group_by, \
summarize, DelayFunction)

选择与首部预览

1
diamonds >> select(X.carat, X.cut, X.price) >> head(5)

筛选

1
2
3
diamonds >> \
sift(X.carat > 4) >> \
select(X.carat, X.cut, X.depth, X.price)

抽样

1
2
3
4
(diamonds >>
sample_n(10) >>
arrange(X.carat) >>
select(X.carat, X.cut, X.depth, X.price))

1
2
3
4
(diamonds >>
sample_frac(0.0002) >>
arrange(X.depth) >>
select(X.carat, X.depth, X.price))

传递整个数据框

可以直接把整个数据框传递给 X._

1
2
3
4
(diamonds >>
sample_n(5) >>
select(X.carat, X.price, X.depth) >>
X._.T)

传递数据框或列给函数

需要使用@DelayFunction。

1
2
3
4
5
6
7
8
9
10
11
12
13
@DelayFunction
def PairwiseGreater(series1, series2):
index = series1.index
newSeries = pd.Series([max(s1, s2) for s1, s2 in zip(series1, series2)])
newSeries.index = index
return newSeries
diamonds >> head(5) >> PairwiseGreater(X.x, X.y)
# 0 3.98
# 1 3.89
# 2 4.07
# 3 4.23
# 4 4.35
# dtype: float64
1
2
3
4
5
6
7
diamonds >> head(5) >> select(X.x, X.y)
# x y
# 0 3.95 3.98
# 1 3.89 3.84
# 2 4.05 4.07
# 3 4.20 4.23
# 4 4.34 4.35

mutate()

1
2
3
4
(diamonds >>
mutate(carat_bin=X.carat.round()) >>
group_by(X.cut, X.carat_bin) >>
summarize(avg_price=X.price.mean()))

如果你的列名不能作为属性使用,你可以使用下面的方式:

1
2
diamonds["column w/ spaces"] = range(len(diamonds))
diamonds >> select(X["column w/ spaces"]) >> head()

将整个数据框传递给 ggplot 进行绘图

如果你的 ggplot 库导入出错,参考这篇博客:基于 Python 的 Anaconda3,导包报错 cannot import name ‘Timestamp’

1
2
3
4
5
6
7
8
from ggplot import ggplot, aes, geom_point, facet_wrap
ggplot = DelayFunction(ggplot)
(diamonds >> ggplot(aes(x = "carat",
y = "price",
color = "cut"),
data = X._) +
geom_point() +
facet_wrap("color"))

将整个数据框传递给 pylab 进行绘图

1
2
3
import pylab as pl
pl.scatter = DelayFunction(pl.scatter)
diamonds >> sample_frac(0.1) >> pl.scatter(X.carat, X.price)

pandas-ply 包

1
2
3
import pandas as pd
from pandas_ply import install_ply, X, sym_call
install_ply(pd)

ply_select()

首先读取 R 的 feather 数据 R 方面需要使用下面的代码生成 flights.feather 文件

1
nycflights13::flights %>% feather::write_feather("flights.feather")

1
2
3
4
5
6
7
8
9
10
import feather
flights = feather.read_dataframe('flights.feather')
flights.head(5)
(flights
.groupby(['year', 'month', 'day'])
.ply_select(
arr = X.arr_delay.mean(),
dep = X.dep_delay.mean())
.ply_where(X.arr > 30, X.dep > 30)
.head(5))

使用 pandas 实现上面的操作:

1
2
3
4
5
6
grouped_flights = flights.groupby(["year", "month", "day"])
output = pd.DataFrame()
output['arr'] = grouped_flights.arr_delay.mean()
output['dep'] = grouped_flights.dep_delay.mean()
filtered_output = output[(output.arr > 30) & (output.dep > 30)]
filtered_output.head(5)
1
2
3
4
# '*'表示选择所有列
flights.ply_select('*',
gain = X.arr_delay - X.dep_delay,
speed = X.distance / X.air_time * 60).head(5)

ply_where()

1
flights.ply_where(X.month == 1, X.day == 1).head(5)
# Python

评论

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×